The state of California has many school districts, with varying ACT and SAT performances. In order to improve the performances in these standardized tests, it is often not enough to just increase funding and resources to districts with poorer performances without identifying the underlying reasons behind the poor results.
This project aims to analyse the student performances on SAT and ACT tests with regards to a number of socio-economic indicators of each district. This helps to identify underlying factors which contributes to potential poor performances so that the state can take a more targeted approach in allocating funds and resources, as well as recommending interventions in areas of concern.
Despite being the technological powerhouse of United States and the 5th highest state in terms of GDP per capita, California seems to underperform when it comes to quality of pre-tertiary education. It ranks 40th in the quality of pre-K121 education (Source) and 41th in quality of public schools (Source). This is not for the lack of education resources either, as the state ranks 20th in terms of education spending(Source). There is therefore an inefficient utilisation of education spendings and an urgency to take a more targeted approach in terms of where and how the money is spent.
The SAT and ACT are standardized tests that many colleges and universities in the United States require for their admissions process. This score is used along with other materials such as grade point average (GPA) and essay responses to determine whether or not a potential student will be accepted to the university.
The SAT has two sections of the test: Evidence-Based Reading and Writing and Math (source). The ACT has 4 sections: English, Mathematics, Reading, and Science, with an additional optional writing section (source). They have different score ranges, which you can read more about on their websites or additional outside sources (a quick Google search will help you understand the scores for each test):
Since the 1940's, an increasing number of colleges have been using scores from sudents' performances on tests like the SAT and the ACT as a measure for college readiness and aptitude (source). Although the Covid-19 pandemic have promopted some US universities to make ACT or SAT non-mandatory for the 2020 admission cycle, they are making a comeback as normality resumes.
Going to a college is still an important engine for social mobility in the US, and going to a university is still correlated with higher earnings Source. Given this, ACT and SAT are still important benchmarks for the education system.
To fulfill the objective of our analysis, we will use the following datasets:
Datasets provided as part of project
act_2019_ca.csv: 2019 ACT Scores in California by Schoolsat_2019_ca.csv: 2019 SAT Scores in California by SchoolDatasets from other sources
california_school_district_info.csv: 2018-2019 California School District information such as NCES ID (For referencing school districts on the NCES database), homelessness, dropout, suspension and other demographic/academic infocalifornia_school_district_NCES_info.csv: additional socio-economic-demographic information about California school districts obtained from National Center for Education Statisitcs(NCES) database
See the notebook for obtaining the data here: additional_data_scraping.ipynb
You can view a sample of a district data page on NCES: Link
# Code:
# calculate mean by dividing sum of the numbers of a list by the length of the list
def calculate_mean(nums):
return sum(nums)/len(nums)
calculate_mean([1,2,3])
2.0
Manually calculate standard deviation:
The formula for standard deviation is below:
$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$
Where $x_i$ represents each value in the dataset, $\mu$ represents the mean of all values in the dataset and $n$ represents the number of values in the dataset.
# Code:
# Calculate standard deviation using the formula
def standard_deviation(nums):
mean = calculate_mean(nums)
return (sum([(x-mean)**2 for x in nums])/len(nums))**(1/2)
standard_deviation([1,2,3])
0.816496580927726
Data cleaning function:
A function that takes in a string that is a number and a percent symbol (ex. '50%', '30.5%', etc.) and converts this to a float that is the decimal approximation of the percent. For example, inputting '50%' in your function should return 0.5, '30.5%' should return 0.305, etc. Make sure to test your function to make sure it works!
# Code: Function to convert percent to decimal approximation of the percent
def decimal_approximation(percentage):
return float(percentage.replace("%", ""))/100 if type(percentage) == str else percentage/100
print(decimal_approximation("12.23%"))
print(decimal_approximation(12.23))
print(decimal_approximation(12))
0.1223 0.1223 0.12
All libraries used to be added here
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import geopandas as gpd
warnings.filterwarnings('ignore')
sns.set_theme(style = 'white')
# First, we will read the ACT and SAT results for schools in California
ACT_df = pd.read_csv("../data/act_2019_ca.csv")
SAT_df = pd.read_csv("../data/sat_2019_ca.csv")
# View first few rows for ACT
print(ACT_df.shape)
ACT_df.head()
(2309, 17)
| CDS | CCode | CDCode | SCode | RType | SName | DName | CName | Enroll12 | NumTstTakr | AvgScrRead | AvgScrEng | AvgScrMath | AvgScrSci | NumGE21 | PctGE21 | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.366990e+13 | 33 | 3366993 | 129882.0 | S | 21st Century Learning Institute | Beaumont Unified | Riverside | 18 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 |
| 1 | 1.964210e+13 | 19 | 1964212 | 1995596.0 | S | ABC Secondary (Alternative) | ABC Unified | Los Angeles | 58 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 |
| 2 | 1.563780e+13 | 15 | 1563776 | 1530377.0 | S | Abraham Lincoln Alternative | Southern Kern Unified | Kern | 18 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 |
| 3 | 4.369670e+13 | 43 | 4369666 | 4333795.0 | S | Abraham Lincoln High | San Jose Unified | Santa Clara | 463 | 53 | 23 | 22 | 22 | 23 | 34 | 64.15 | 2018-19 |
| 4 | 1.964730e+13 | 19 | 1964733 | 1935121.0 | S | Abraham Lincoln Senior High | Los Angeles Unified | Los Angeles | 226 | 19 | 21 | 20 | 23 | 22 | 11 | 57.89 | 2018-19 |
pd.set_option('display.max_columns', 500)
# View first few rows for SAT
print(SAT_df.shape)
SAT_df.head(5)
(2579, 25)
| CDS | CCode | CDCode | SCode | RType | SName | DName | CName | Enroll12 | NumTSTTakr12 | NumERWBenchmark12 | PctERWBenchmark12 | NumMathBenchmark12 | PctMathBenchmark12 | TotNumBothBenchmark12 | PctBothBenchmark12 | Enroll11 | NumTSTTakr11 | NumERWBenchmark11 | PctERWBenchmark11 | NumMathBenchmark11 | PctMathBenchmark11 | TotNumBothBenchmark11 | PctBothBenchmark11 | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.615980e+12 | 6 | 661598 | 630046 | S | Colusa Alternative Home | Colusa Unified | Colusa | 18 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 18 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 |
| 1 | 6.616060e+12 | 6 | 661606 | 634758 | S | Maxwell Sr High | Maxwell Unified | Colusa | 29 | 10 | * | * | * | * | * | * | 26 | 6 | * | * | * | * | * | * | 2018-19 |
| 2 | 1.964730e+13 | 19 | 1964733 | 1930924 | S | Belmont Senior High | Los Angeles Unified | Los Angeles | 206 | 102 | 31 | 30.39 | 14 | 13.73 | 14 | 13.73 | 219 | 174 | 42 | 24.14 | 12 | 6.9 | 11 | 6.32 | 2018-19 |
| 3 | 1.964730e+13 | 19 | 1964733 | 1931476 | S | Canoga Park Senior High | Los Angeles Unified | Los Angeles | 227 | 113 | 54 | 47.79 | 18 | 15.93 | 18 | 15.93 | 333 | 275 | 97 | 35.27 | 37 | 13.45 | 35 | 12.73 | 2018-19 |
| 4 | 1.964730e+13 | 19 | 1964733 | 1931856 | S | Whitman Continuation | Los Angeles Unified | Los Angeles | 18 | 14 | * | * | * | * | * | * | 17 | 5 | * | * | * | * | * | * | 2018-19 |
From the dataframes, we can see that some values are displayed as NaN, some, such as "NumTstTakr" are "0", while others are displayed as *.
To summarise, as we are interested in district level analysis, we will group the data by education district. As such, cleaning up of missing values may not be of significant importance as we will aggregate the values by district. For example, we may sum up the total students enrolled within a district or take the mean of test scores for the available scores in the district.
Nonetheless, we will first visualize the the missing values.
# We will first convert all * to NaN, so that it is easier to taublate the missing values
# For ACT
ACT_df = ACT_df.replace("*", np.nan)
# For SAT
SAT_df = SAT_df.replace("*", np.nan)
# For ACT
# Create a bar chart showing how many missing values are in each column
plt.figure(figsize = (20, 5))
null_bar = plt.bar(x = ACT_df.isnull().sum().index, height = ACT_df.isnull().sum())
plt.xlabel("Column Name")
plt.ylabel("Number of missing values")
plt.xticks(rotation = 45)
plt.title("ACT Missing Values")
plt.bar_label(null_bar);
# For SAT
# Create a bar chart showing how many missing values are in each column
plt.figure(figsize = (20, 5))
null_bar = plt.bar(x = SAT_df.isnull().sum().index, height = SAT_df.isnull().sum())
plt.xlabel("Column Name")
plt.ylabel("Number of missing values")
plt.title("SAT Missing Values")
plt.xticks(rotation = 45)
plt.bar_label(null_bar);
Observations from missing data
Seems that the missing values mainly came from school names, and test scores.
The missing values from test scores could be due to lack of information. It could also be because there were no test takers during the year. We will not delete these missing rows as we are mainly interested in the mean scores of the district.
There are also some missing values in district name ("DNAME") column, however, we also observe that we have full information for district codes, as such, we may be able to obtain the names by referencing the datasets with other data source.
For now, we will convert the data types from object to string/float so that we can aggregate the data by district.
# First we will see the data types for both df
print("ACT before conversion:\n", ACT_df.dtypes)
print("SAT before conversion:\n", SAT_df.dtypes)
ACT before conversion: CDS float64 CCode int64 CDCode int64 SCode float64 RType object SName object DName object CName object Enroll12 int64 NumTstTakr int64 AvgScrRead object AvgScrEng object AvgScrMath object AvgScrSci object NumGE21 object PctGE21 object Year object dtype: object SAT before conversion: CDS float64 CCode int64 CDCode int64 SCode int64 RType object SName object DName object CName object Enroll12 int64 NumTSTTakr12 int64 NumERWBenchmark12 object PctERWBenchmark12 object NumMathBenchmark12 object PctMathBenchmark12 object TotNumBothBenchmark12 object PctBothBenchmark12 object Enroll11 int64 NumTSTTakr11 int64 NumERWBenchmark11 object PctERWBenchmark11 object NumMathBenchmark11 object PctMathBenchmark11 object TotNumBothBenchmark11 object PctBothBenchmark11 object Year object dtype: object
# Convert dtype for test scores and benchmarks
# ACT_df columns to convert
ACT_to_convert_col_names = ACT_df.columns[10:16]
#print("ACT_df columns to convert:\n", ACT_to_convert_col_names)
# SAT_df columns to convert
SAT_to_convert_col_names = SAT_df.columns[10:16].append(SAT_df.columns[18:-1])
#print("ACT_df columns to convert:\n", SAT_to_convert_col_names)
# Converstion
ACT_df[ACT_to_convert_col_names] = ACT_df[ACT_to_convert_col_names].astype("float")
SAT_df[SAT_to_convert_col_names] = SAT_df[SAT_to_convert_col_names].astype("float")
print("ACT after conversion:\n", ACT_df.dtypes)
print("SAT after conversion:\n", SAT_df.dtypes)
ACT after conversion: CDS float64 CCode int64 CDCode int64 SCode float64 RType object SName object DName object CName object Enroll12 int64 NumTstTakr int64 AvgScrRead float64 AvgScrEng float64 AvgScrMath float64 AvgScrSci float64 NumGE21 float64 PctGE21 float64 Year object dtype: object SAT after conversion: CDS float64 CCode int64 CDCode int64 SCode int64 RType object SName object DName object CName object Enroll12 int64 NumTSTTakr12 int64 NumERWBenchmark12 float64 PctERWBenchmark12 float64 NumMathBenchmark12 float64 PctMathBenchmark12 float64 TotNumBothBenchmark12 float64 PctBothBenchmark12 float64 Enroll11 int64 NumTSTTakr11 int64 NumERWBenchmark11 float64 PctERWBenchmark11 float64 NumMathBenchmark11 float64 PctMathBenchmark11 float64 TotNumBothBenchmark11 float64 PctBothBenchmark11 float64 Year object dtype: object
SAT_df.isnull().sum()
CDS 0 CCode 0 CDCode 0 SCode 0 RType 0 SName 597 DName 58 CName 0 Enroll12 0 NumTSTTakr12 0 NumERWBenchmark12 857 PctERWBenchmark12 857 NumMathBenchmark12 857 PctMathBenchmark12 857 TotNumBothBenchmark12 857 PctBothBenchmark12 857 Enroll11 0 NumTSTTakr11 0 NumERWBenchmark11 843 PctERWBenchmark11 843 NumMathBenchmark11 843 PctMathBenchmark11 843 TotNumBothBenchmark11 843 PctBothBenchmark11 843 Year 0 dtype: int64
We would get the district level data, which is indicated by "D" in "RType column
# For ACT
ACT_district_df = ACT_df.loc[ACT_df["RType"] == "D"]
ACT_district_df.head()
| CDS | CCode | CDCode | SCode | RType | SName | DName | CName | Enroll12 | NumTstTakr | AvgScrRead | AvgScrEng | AvgScrMath | AvgScrSci | NumGE21 | PctGE21 | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1122 | 1.100170e+12 | 1 | 110017 | NaN | D | NaN | Alameda County Office of Education | Alameda | 170 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 |
| 1123 | 1.611190e+12 | 1 | 161119 | NaN | D | NaN | Alameda Unified | Alameda | 919 | 155 | 27.0 | 26.0 | 26.0 | 25.0 | 134.0 | 86.45 | 2018-19 |
| 1124 | 1.611270e+12 | 1 | 161127 | NaN | D | NaN | Albany City Unified | Alameda | 307 | 58 | 28.0 | 27.0 | 27.0 | 26.0 | 51.0 | 87.93 | 2018-19 |
| 1125 | 1.611430e+12 | 1 | 161143 | NaN | D | NaN | Berkeley Unified | Alameda | 820 | 227 | 26.0 | 25.0 | 25.0 | 24.0 | 166.0 | 73.13 | 2018-19 |
| 1126 | 1.611500e+12 | 1 | 161150 | NaN | D | NaN | Castro Valley Unified | Alameda | 737 | 106 | 26.0 | 25.0 | 25.0 | 25.0 | 82.0 | 77.36 | 2018-19 |
# For ACT
SAT_district_df = SAT_df.loc[SAT_df["RType"] == "D"]
#SAT_district_df.loc[(~SAT_district_df.PctBothBenchmark11.isnull()) | (~SAT_district_df.NumERWBenchmark12.isnull())]
SAT_district_df.head()
| CDS | CCode | CDCode | SCode | RType | SName | DName | CName | Enroll12 | NumTSTTakr12 | NumERWBenchmark12 | PctERWBenchmark12 | NumMathBenchmark12 | PctMathBenchmark12 | TotNumBothBenchmark12 | PctBothBenchmark12 | Enroll11 | NumTSTTakr11 | NumERWBenchmark11 | PctERWBenchmark11 | NumMathBenchmark11 | PctMathBenchmark11 | TotNumBothBenchmark11 | PctBothBenchmark11 | Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2037 | 1.611760e+12 | 1 | 161176 | 0 | D | NaN | Fremont Unified | Alameda | 2537 | 845 | 767.0 | 90.77 | 695.0 | 82.25 | 678.0 | 80.24 | 2555 | 1396 | 1365.0 | 97.78 | 1321.0 | 94.63 | 1312.0 | 93.98 | 2018-19 |
| 2038 | 1.612750e+12 | 1 | 161275 | 0 | D | NaN | Piedmont City Unified | Alameda | 231 | 78 | 68.0 | 87.18 | 62.0 | 79.49 | 61.0 | 78.21 | 208 | 97 | 97.0 | 100.00 | 96.0 | 98.97 | 96.0 | 98.97 | 2018-19 |
| 2039 | 1.612910e+12 | 1 | 161291 | 0 | D | NaN | San Leandro Unified | Alameda | 754 | 193 | 138.0 | 71.50 | 83.0 | 43.01 | 77.0 | 39.90 | 695 | 458 | 239.0 | 52.18 | 140.0 | 30.57 | 122.0 | 26.64 | 2018-19 |
| 2040 | 1.062170e+13 | 10 | 1062166 | 0 | D | NaN | Fresno Unified | Fresno | 4593 | 1048 | 611.0 | 58.30 | 344.0 | 32.82 | 323.0 | 30.82 | 4751 | 3017 | 1508.0 | 49.98 | 723.0 | 23.96 | 681.0 | 22.57 | 2018-19 |
| 2041 | 1.075130e+13 | 10 | 1075127 | 0 | D | NaN | Mendota Unified | Fresno | 234 | 69 | 23.0 | 33.33 | 11.0 | 15.94 | 9.0 | 13.04 | 197 | 76 | 43.0 | 56.58 | 24.0 | 31.58 | 21.0 | 27.63 | 2018-19 |
Next, we will add additional information to the dataframe, while taking note of the following conditions due to missing values:
For ACT
For SAT
# For ACT
ACT_district_df["act_participation_rate"] = ACT_district_df["NumTstTakr"] / ACT_district_df["Enroll12"]
# rename some column names
ACT_district_df.rename(columns = {"Enroll12": "act_enroll",
"NumTstTakr":"act_num_test_taker",
"AvgScrRead":"act_average_reading_score",
"AvgScrEng":"act_average_english_score",
"AvgScrMath":"act_average_math_score",
"AvgScrSci":"act_average_science_score",
"NumGE21":"act_num_above_average_score",
"PctGE21":"act_percentage_above_average_score"
}, inplace = True)
# Convert percentage to decimal approximation
ACT_district_df["act_percentage_above_average_score"] = ACT_district_df["act_percentage_above_average_score"] / 100
ACT_district_df.head()
| CDS | CCode | CDCode | SCode | RType | SName | DName | CName | act_enroll | act_num_test_taker | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | Year | act_participation_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1122 | 1.100170e+12 | 1 | 110017 | NaN | D | NaN | Alameda County Office of Education | Alameda | 170 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 | 0.000000 |
| 1123 | 1.611190e+12 | 1 | 161119 | NaN | D | NaN | Alameda Unified | Alameda | 919 | 155 | 27.0 | 26.0 | 26.0 | 25.0 | 134.0 | 0.8645 | 2018-19 | 0.168662 |
| 1124 | 1.611270e+12 | 1 | 161127 | NaN | D | NaN | Albany City Unified | Alameda | 307 | 58 | 28.0 | 27.0 | 27.0 | 26.0 | 51.0 | 0.8793 | 2018-19 | 0.188925 |
| 1125 | 1.611430e+12 | 1 | 161143 | NaN | D | NaN | Berkeley Unified | Alameda | 820 | 227 | 26.0 | 25.0 | 25.0 | 24.0 | 166.0 | 0.7313 | 2018-19 | 0.276829 |
| 1126 | 1.611500e+12 | 1 | 161150 | NaN | D | NaN | Castro Valley Unified | Alameda | 737 | 106 | 26.0 | 25.0 | 25.0 | 25.0 | 82.0 | 0.7736 | 2018-19 | 0.143826 |
# For SAT
# For enrolment and test taker numbers which do not have missing values
SAT_district_df["sat_enroll"] = SAT_district_df["Enroll12"] + SAT_district_df["Enroll11"]
SAT_district_df["sat_num_test_taker"] = SAT_district_df["NumTSTTakr12"] + SAT_district_df["NumTSTTakr11"]
SAT_district_df["sat_participation_rate"] = SAT_district_df["sat_num_test_taker"] / SAT_district_df["sat_enroll"]
# A function to get combined benchmark values depending on the availability of data for each row
def combined_benchmark(row):
if np.isnan(row["NumERWBenchmark12"]) and np.isnan(row["NumERWBenchmark11"]): #if both grade 11 and 12 missing
row["sat_num_erw_benchmark"] = np.nan
row["sat_percentage_erw_benchmark"] = np.nan
row["sat_num_math_benchmark"] = np.nan
row["sat_percentage_math_benchmark"] = np.nan
row["sat_num_both_benchmark"] = np.nan
row["sat_percentage_both_benchmark"] = np.nan
elif ~np.isnan(row["NumERWBenchmark12"]) and np.isnan(row["NumERWBenchmark11"]): # if grade 12 available and grade 11 missing
row["sat_num_erw_benchmark"] = row["NumERWBenchmark12"]
row["sat_percentage_erw_benchmark"] = row["sat_num_erw_benchmark"] / row["NumTSTTakr12"]
row["sat_num_math_benchmark"] = row["NumMathBenchmark12"]
row["sat_percentage_math_benchmark"] = row["sat_num_math_benchmark"] / row["NumTSTTakr12"]
row["sat_num_both_benchmark"] = row["TotNumBothBenchmark12"]
row["sat_percentage_both_benchmark"] = row["sat_num_both_benchmark"] / row["NumTSTTakr12"]
elif np.isnan(row["NumERWBenchmark12"]) and ~np.isnan(row["NumERWBenchmark11"]): # if grade 12 missing and grade 11 available
row["sat_num_erw_benchmark"] = row["NumERWBenchmark11"]
row["sat_percentage_erw_benchmark"] = row["sat_num_erw_benchmark"] / row["NumTSTTakr11"]
row["sat_num_math_benchmark"] = row["NumMathBenchmark11"]
row["sat_percentage_math_benchmark"] = row["sat_num_math_benchmark"] / row["NumTSTTakr11"]
row["sat_num_both_benchmark"] = row["TotNumBothBenchmark11"]
row["sat_percentage_both_benchmark"] = row["sat_num_both_benchmark"] / row["NumTSTTakr11"]
else: # if both grades available
row["sat_num_erw_benchmark"] = row["NumERWBenchmark12"] + row["NumERWBenchmark11"]
row["sat_percentage_erw_benchmark"] = row["sat_num_erw_benchmark"] / row["sat_num_test_taker"]
row["sat_num_math_benchmark"] = row["NumMathBenchmark12"] + row["NumMathBenchmark11"]
row["sat_percentage_math_benchmark"] = row["sat_num_math_benchmark"] / row["sat_num_test_taker"]
row["sat_num_both_benchmark"] = row["TotNumBothBenchmark12"] + row["TotNumBothBenchmark11"]
row["sat_percentage_both_benchmark"] = row["sat_num_both_benchmark"] / row["sat_num_test_taker"]
return row
SAT_district_df = SAT_district_df.apply(combined_benchmark, axis = 1)
SAT_district_df.tail()
| CDS | CCode | CDCode | SCode | RType | SName | DName | CName | Enroll12 | NumTSTTakr12 | NumERWBenchmark12 | PctERWBenchmark12 | NumMathBenchmark12 | PctMathBenchmark12 | TotNumBothBenchmark12 | PctBothBenchmark12 | Enroll11 | NumTSTTakr11 | NumERWBenchmark11 | PctERWBenchmark11 | NumMathBenchmark11 | PctMathBenchmark11 | TotNumBothBenchmark11 | PctBothBenchmark11 | Year | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2571 | 4.970840e+13 | 49 | 4970839 | 0 | D | NaN | Oak Grove Union Elementary | Sonoma | 70 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | 82 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 | 152 | 2 | 0.013158 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2572 | 5.071090e+13 | 50 | 5071092 | 0 | D | NaN | Hart-Ransom Union Elementary | Stanislaus | 18 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 17 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 | 35 | 0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2573 | 5.071130e+13 | 50 | 5071134 | 0 | D | NaN | Keyes Union | Stanislaus | 25 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 20 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 2018-19 | 45 | 0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2574 | 5.075560e+13 | 50 | 5075556 | 0 | D | NaN | Riverbank Unified | Stanislaus | 179 | 40 | 23.0 | 57.50 | 13.0 | 32.50 | 11.0 | 27.5 | 213 | 78 | 45.0 | 57.69 | 25.0 | 32.05 | 21.0 | 26.92 | 2018-19 | 392 | 118 | 0.301020 | 68.0 | 0.576271 | 38.0 | 0.322034 | 32.0 | 0.271186 |
| 2575 | 5.772680e+13 | 57 | 5772678 | 0 | D | NaN | Davis Joint Unified | Yolo | 717 | 307 | 293.0 | 95.44 | 274.0 | 89.25 | 268.0 | 87.3 | 739 | 474 | 454.0 | 95.78 | 428.0 | 90.30 | 421.0 | 88.82 | 2018-19 | 1456 | 781 | 0.536401 | 747.0 | 0.956466 | 702.0 | 0.898848 | 689.0 | 0.882202 |
Next, we will merge the ACT and SAT datasets to get a dataset with both test data
# First, we only require some of the columns, so we will extract these
# For ACT
ACT_district_df_subset = ACT_district_df[["CDCode", "act_enroll", "act_num_test_taker", "act_participation_rate",
"act_average_reading_score","act_average_english_score", "act_average_math_score",
"act_average_science_score", "act_num_above_average_score", "act_percentage_above_average_score"]]
# For SAT
SAT_district_df_subset = SAT_district_df[["CDCode", "sat_enroll", "sat_num_test_taker", "sat_participation_rate",
"sat_num_erw_benchmark","sat_percentage_erw_benchmark",
"sat_num_math_benchmark", "sat_percentage_math_benchmark",
"sat_num_both_benchmark", "sat_percentage_both_benchmark"]]
# We will add a column with a sum total for all average scores
ACT_district_df_subset['act_average_composite_score'] = (ACT_district_df_subset['act_average_reading_score'] + ACT_district_df_subset['act_average_english_score'] + ACT_district_df_subset['act_average_math_score'] + ACT_district_df_subset['act_average_science_score']) / 4
print(ACT_district_df_subset.shape)
ACT_district_df_subset.head()
(522, 11)
| CDCode | act_enroll | act_num_test_taker | act_participation_rate | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | act_average_composite_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1122 | 110017 | 170 | 0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1123 | 161119 | 919 | 155 | 0.168662 | 27.0 | 26.0 | 26.0 | 25.0 | 134.0 | 0.8645 | 26.00 |
| 1124 | 161127 | 307 | 58 | 0.188925 | 28.0 | 27.0 | 27.0 | 26.0 | 51.0 | 0.8793 | 27.00 |
| 1125 | 161143 | 820 | 227 | 0.276829 | 26.0 | 25.0 | 25.0 | 24.0 | 166.0 | 0.7313 | 25.00 |
| 1126 | 161150 | 737 | 106 | 0.143826 | 26.0 | 25.0 | 25.0 | 25.0 | 82.0 | 0.7736 | 25.25 |
print(SAT_district_df_subset.shape)
SAT_district_df_subset.head()
(539, 10)
| CDCode | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | |
|---|---|---|---|---|---|---|---|---|---|---|
| 2037 | 161176 | 5092 | 2241 | 0.440102 | 2132.0 | 0.951361 | 2016.0 | 0.899598 | 1990.0 | 0.887996 |
| 2038 | 161275 | 439 | 175 | 0.398633 | 165.0 | 0.942857 | 158.0 | 0.902857 | 157.0 | 0.897143 |
| 2039 | 161291 | 1449 | 651 | 0.449275 | 377.0 | 0.579109 | 223.0 | 0.342550 | 199.0 | 0.305684 |
| 2040 | 1062166 | 9344 | 4065 | 0.435039 | 2119.0 | 0.521279 | 1067.0 | 0.262485 | 1004.0 | 0.246986 |
| 2041 | 1075127 | 431 | 145 | 0.336427 | 66.0 | 0.455172 | 35.0 | 0.241379 | 30.0 | 0.206897 |
# Next we will merge both test result subsets
test_results_combined = ACT_district_df_subset.merge(right = SAT_district_df_subset, how = "outer", on = "CDCode")
print(test_results_combined.shape)
test_results_combined.head()
(539, 20)
| CDCode | act_enroll | act_num_test_taker | act_participation_rate | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | act_average_composite_score | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 110017 | 170.0 | 0.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 345 | 143 | 0.414493 | 62.0 | 0.433566 | 28.0 | 0.195804 | 24.0 | 0.167832 |
| 1 | 161119 | 919.0 | 155.0 | 0.168662 | 27.0 | 26.0 | 26.0 | 25.0 | 134.0 | 0.8645 | 26.00 | 1790 | 812 | 0.453631 | 677.0 | 0.833744 | 598.0 | 0.736453 | 570.0 | 0.701970 |
| 2 | 161127 | 307.0 | 58.0 | 0.188925 | 28.0 | 27.0 | 27.0 | 26.0 | 51.0 | 0.8793 | 27.00 | 569 | 261 | 0.458699 | 254.0 | 0.973180 | 238.0 | 0.911877 | 237.0 | 0.908046 |
| 3 | 161143 | 820.0 | 227.0 | 0.276829 | 26.0 | 25.0 | 25.0 | 24.0 | 166.0 | 0.7313 | 25.00 | 1691 | 704 | 0.416322 | 621.0 | 0.882102 | 525.0 | 0.745739 | 518.0 | 0.735795 |
| 4 | 161150 | 737.0 | 106.0 | 0.143826 | 26.0 | 25.0 | 25.0 | 25.0 | 82.0 | 0.7736 | 25.25 | 1496 | 661 | 0.441845 | 606.0 | 0.916793 | 545.0 | 0.824508 | 527.0 | 0.797277 |
# And see the number of missing values
test_results_combined.isnull().sum()
CDCode 0 act_enroll 17 act_num_test_taker 17 act_participation_rate 17 act_average_reading_score 189 act_average_english_score 189 act_average_math_score 189 act_average_science_score 189 act_num_above_average_score 189 act_percentage_above_average_score 189 act_average_composite_score 189 sat_enroll 0 sat_num_test_taker 0 sat_participation_rate 0 sat_num_erw_benchmark 124 sat_percentage_erw_benchmark 124 sat_num_math_benchmark 124 sat_percentage_math_benchmark 124 sat_num_both_benchmark 124 sat_percentage_both_benchmark 124 dtype: int64
It can be seen that there are 17 more districts with SAT test than ACT test.
ACT test also has more missing values than SAT test.
Let us see how many records there are with missing scores/benchmarks for both tests.
# number of rows with missing score/benchmark values for both ACT and SAT
test_results_combined[(test_results_combined["act_average_reading_score"].isnull()) & (test_results_combined["sat_num_erw_benchmark"].isnull())].shape
(119, 20)
There are 119 districts with missing score/benchmark values for both tests. For now, we will not remove these as we can still get the participation rate from these records.
test_results_combined.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 539 entries, 0 to 538 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CDCode 539 non-null int64 1 act_enroll 522 non-null float64 2 act_num_test_taker 522 non-null float64 3 act_participation_rate 522 non-null float64 4 act_average_reading_score 350 non-null float64 5 act_average_english_score 350 non-null float64 6 act_average_math_score 350 non-null float64 7 act_average_science_score 350 non-null float64 8 act_num_above_average_score 350 non-null float64 9 act_percentage_above_average_score 350 non-null float64 10 act_average_composite_score 350 non-null float64 11 sat_enroll 539 non-null int64 12 sat_num_test_taker 539 non-null int64 13 sat_participation_rate 539 non-null float64 14 sat_num_erw_benchmark 415 non-null float64 15 sat_percentage_erw_benchmark 415 non-null float64 16 sat_num_math_benchmark 415 non-null float64 17 sat_percentage_math_benchmark 415 non-null float64 18 sat_num_both_benchmark 415 non-null float64 19 sat_percentage_both_benchmark 415 non-null float64 dtypes: float64(17), int64(3) memory usage: 88.4 KB
The datatypes for different variables also looks to be what we wanted them to be.
We will now attempt to process additional datasets about the socio-economic-demographic condition of the districts. This is so that we can draw deeper insights regarding what might have resulted in the poor results in some districts.
school_district_info_df = pd.read_csv("../data/california_school_district_info.csv")
school_district_nces_info_df = pd.read_csv("../data/california_school_district_NCES_info.csv")
print(school_district_info_df.shape)
school_district_info_df.head()
(944, 69)
| OBJECTID | Year | FedID | CDCode | CDSCode | CountyName | DistrictNa | DistrictTy | GradeLow | GradeHigh | GradeLowCe | GradeHighC | AssistStat | CongressUS | SenateCA | AssemblyCA | LocaleDist | UpdateNote | EnrollTota | EnrollChar | EnrollNonC | AAcount | AApct | AIcount | AIpct | AScount | ASpct | FIcount | FIpct | HIcount | HIpct | MRcount | MRpct | PIcount | PIpct | WHcount | WHpct | NRcount | NRpct | ELcount | ELpct | FOScount | FOSpct | HOMcount | HOMpct | MIGcount | MIGpct | SWDcount | SWDpct | SEDcount | SEDpct | UPcount | UPpct | ELATested | ELAStdMetP | MathTested | MathStdMet | CCPrepCoho | CCPrepPct | AbsentElig | AbsentPct | GradCohort | GradPct | UCCSUReqMe | DropOutPct | EnrollCumu | SuspPct | SHAPE_Leng | SHAPE_Area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2018-19 | 601770 | 161119 | 1.611190e+12 | Alameda | Alameda Unified | Unified | KG | 12 | KG | 12 | Differentiated Assistance | 13 | 9 | 18 | 21 - Suburban, Large | 2018-19 District Boundary Adjustment/Correction | 11260 | 1877 | 9383 | 825 | 7.300000 | 28 | 0.2 | 2854 | 25.299999 | 683 | 6.1 | 1960 | 17.400000 | 1310 | 11.6 | 67 | 0.6 | 3244 | 28.799999 | 289 | 2.6 | 1601 | 14.2 | 26 | 0.2 | 67 | 0.6 | 0 | 0.0 | 1289 | 11.4 | 3523 | 31.299999 | 3144 | 33.500000 | 4567.0 | 69.400002 | 4587.0 | 59.700001 | 773.0 | 56.799999 | 11453.0 | 9.4 | 882.0 | 88.900002 | 58.900002 | 5.2 | 11600.0 | 2.6 | 57196.40394 | 4.733304e+07 |
| 1 | 2 | 2018-19 | 601860 | 161127 | 1.611270e+12 | Alameda | Albany City Unified | Unified | KG | 12 | KG | 12 | General Assistance | 13 | 9 | 15 | 21 - Suburban, Large | 2018-19 District Boundary Adjustment/Correction | 3682 | 0 | 3682 | 111 | 3.000000 | 4 | 0.1 | 1063 | 28.900000 | 47 | 1.3 | 592 | 16.100000 | 510 | 13.9 | 6 | 0.2 | 1163 | 31.600000 | 186 | 5.1 | 583 | 15.8 | 2 | 0.1 | 12 | 0.3 | 0 | 0.0 | 312 | 8.5 | 777 | 21.100000 | 1118 | 30.400000 | 1773.0 | 79.199997 | 1795.0 | 72.000000 | 302.0 | 62.599998 | 3763.0 | 6.4 | 301.0 | 94.699997 | 63.900002 | 1.7 | 3830.0 | 1.9 | 12870.77216 | 7.161276e+06 |
| 2 | 3 | 2018-19 | 604740 | 161143 | 1.611430e+12 | Alameda | Berkeley Unified | Unified | KG | 12 | KG | 12 | Differentiated Assistance | 13 | 9 | 15 | 12 - City, Midsize | 2018-19 District Boundary Adjustment/Correction | 10194 | 383 | 9811 | 1440 | 14.100000 | 16 | 0.2 | 731 | 7.200000 | 83 | 0.8 | 2438 | 23.900000 | 1465 | 14.4 | 14 | 0.1 | 3999 | 39.200001 | 8 | 0.1 | 953 | 9.3 | 23 | 0.2 | 265 | 2.6 | 0 | 0.0 | 1148 | 11.3 | 3331 | 32.700001 | 3275 | 33.400002 | 4614.0 | 66.699997 | 4549.0 | 61.700001 | 790.0 | 63.299999 | 10404.0 | 9.7 | 856.0 | 87.400002 | 83.300003 | 10.7 | 10534.0 | 1.4 | 44426.28741 | 4.376389e+07 |
| 3 | 4 | 2018-19 | 607800 | 161150 | 1.611500e+12 | Alameda | Castro Valley Unified | Unified | KG | 12 | KG | 12 | General Assistance | 13,15 | 07,09,10 | 18,20 | 21 - Suburban, Large | 2018-19 District Boundary Adjustment/Correction | 9269 | 0 | 9269 | 489 | 5.300000 | 19 | 0.2 | 2539 | 27.400000 | 420 | 4.5 | 2204 | 23.799999 | 909 | 9.8 | 47 | 0.5 | 2480 | 26.799999 | 162 | 1.7 | 1021 | 11.0 | 12 | 0.1 | 68 | 0.7 | 0 | 0.0 | 800 | 8.6 | 2284 | 24.600000 | 2739 | 29.600000 | 4844.0 | 68.099998 | 4850.0 | 63.200001 | 737.0 | 57.099998 | 9433.0 | 7.4 | 742.0 | 92.599998 | 55.900002 | 3.0 | 9532.0 | 2.2 | 142338.49410 | 2.833205e+08 |
| 4 | 5 | 2018-19 | 612630 | 161168 | 1.611680e+12 | Alameda | Emery Unified | Unified | KG | 12 | KG | 12 | Differentiated Assistance | 13 | 9 | 15 | 21 - Suburban, Large | 2018-19 District Boundary Adjustment/Correction | 743 | 0 | 743 | 326 | 43.900002 | 1 | 0.1 | 83 | 11.200000 | 11 | 1.5 | 185 | 24.900000 | 62 | 8.3 | 8 | 1.1 | 67 | 9.000000 | 0 | 0.0 | 160 | 21.5 | 1 | 0.1 | 7 | 0.9 | 0 | 0.0 | 55 | 7.4 | 600 | 80.800003 | 613 | 83.099998 | 376.0 | 37.099998 | 377.0 | 23.299999 | 40.0 | 22.500000 | 760.0 | 20.5 | 40.0 | 100.000000 | 40.000000 | 0.0 | 780.0 | 3.6 | 13762.86929 | 5.368212e+06 |
print(school_district_nces_info_df.shape)
school_district_nces_info_df.head()
(944, 33)
| Unnamed: 0 | NCES_ID | district_name | district_type | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | district_urban_type | expenditure_per_student | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 601770 | Alameda Unified | Unified | 78,522 | $104,756 | 30,418 | 43% | 7% | 13% | 30% | 0% | 0% | 1% | 6% | 6.60% | 31.30% | 62.20% | 90.40% | 52.90% | 47.30% | 71.90% | 5.10% | 98.20% | 9.80% | 71% | 5% | 19% | 5% | 13.6 | 61.2 | Suburb: Large (21) | 14325.0 |
| 1 | 1 | 601860 | Albany City Unified | Unified | 19,862 | $95,783 | 7,467 | 44% | 3% | 15% | 29% | 1% | 0% | 2% | 6% | 13% | 23.30% | 63.60% | 83.80% | 52.60% | 47.30% | 66.20% | 1.90% | 99.40% | 6.50% | 75% | 5% | 16% | 4% | 15.5 | 79.2 | Suburb: Large (21) | 21087.0 |
| 2 | 2 | 604740 | Berkeley Unified | Unified | 121,444 | $85,444 | 45,337 | 53% | 8% | 11% | 21% | 0% | 0% | 0% | 6% | 5.30% | 16.80% | 77.90% | 91.20% | 45.80% | 54.20% | 80.90% | 3.10% | 98.70% | 4.30% | 68% | 5% | 21% | 6% | 11.8 | 74.3 | City: Midsize (12) | 17532.0 |
| 3 | 3 | 607800 | Castro Valley Unified | Unified | 50,946 | $114,277 | 17,666 | 47% | 5% | 13% | 30% | 1% | 0% | 0% | 5% | 5.30% | 39.20% | 55.50% | 92% | 84% | 16.10% | 71.80% | 4% | 98.20% | 7.90% | 75% | 5% | 16% | 5% | 17.6 | 57.3 | Suburb: Large (21) | 16312.0 |
| 4 | 4 | 612630 | Emery Unified | Unified | 11,899 | $102,725 | 6,568 | 40% | 15% | 10% | 29% | 0% | 1% | 0% | 6% | 33.20% | 49.80% | 16.90% | 91.20% | 13% | 87% | 67.50% | 0% | 100% | 22.40% | 47% | 3% | 47% | 3% | 17.6 | 56.5 | Suburb: Large (21) | 23930.0 |
We will add just two more columns for percentage of students who went to charter school and non-charter school.
school_district_info_df["charter_school_percentage"] = school_district_info_df["EnrollChar"] / school_district_info_df["EnrollTota"]* 100
school_district_info_df["non_charter_school_percentage"] = school_district_info_df["EnrollNonC"] / school_district_info_df["EnrollTota"]* 100
As both dataframes could certain information of the same categories, we will extract a subset of features from both dataframes before merging them.
# Extract subset from "school_district_info_df"
cols_school_district_info_df = ["FedID","CDCode","CountyName","DistrictNa","DistrictTy","LocaleDist","EnrollTota","charter_school_percentage","non_charter_school_percentage","HOMpct","MIGpct","DropOutPct","SuspPct"]
school_district_info_df_subset = school_district_info_df[cols_school_district_info_df]
# rename column names
school_district_info_df_subset.rename({"FedID": "NCES_ID", "CountyName": "county_name", "DistrictNa": "district_name",
"DistrictTy": "district_type", "LocaleDist": "urban_locale", "EnrollTota": "total_enrolment",
"HOMpct": "homeless_student_percentage", "MIGpct": "migrant_student_percentage",
"DropOutPct": "dropout_percentage", "SuspPct": "suspension_percentage"}, inplace = True, axis = 1)
# Extract subset from "school_district_nces_info_df"
school_district_nces_info_df_subset = school_district_nces_info_df.drop(["district_name", "district_type", "district_urban_type"], axis = 1)
print(school_district_info_df_subset.shape)
school_district_info_df_subset.head()
(944, 13)
| NCES_ID | CDCode | county_name | district_name | district_type | urban_locale | total_enrolment | charter_school_percentage | non_charter_school_percentage | homeless_student_percentage | migrant_student_percentage | dropout_percentage | suspension_percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 601770 | 161119 | Alameda | Alameda Unified | Unified | 21 - Suburban, Large | 11260 | 16.669627 | 83.330373 | 0.6 | 0.0 | 5.2 | 2.6 |
| 1 | 601860 | 161127 | Alameda | Albany City Unified | Unified | 21 - Suburban, Large | 3682 | 0.000000 | 100.000000 | 0.3 | 0.0 | 1.7 | 1.9 |
| 2 | 604740 | 161143 | Alameda | Berkeley Unified | Unified | 12 - City, Midsize | 10194 | 3.757112 | 96.242888 | 2.6 | 0.0 | 10.7 | 1.4 |
| 3 | 607800 | 161150 | Alameda | Castro Valley Unified | Unified | 21 - Suburban, Large | 9269 | 0.000000 | 100.000000 | 0.7 | 0.0 | 3.0 | 2.2 |
| 4 | 612630 | 161168 | Alameda | Emery Unified | Unified | 21 - Suburban, Large | 743 | 0.000000 | 100.000000 | 0.9 | 0.0 | 0.0 | 3.6 |
print(school_district_nces_info_df_subset.shape)
school_district_nces_info_df_subset.head()
(944, 30)
| Unnamed: 0 | NCES_ID | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | expenditure_per_student | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 601770 | 78,522 | $104,756 | 30,418 | 43% | 7% | 13% | 30% | 0% | 0% | 1% | 6% | 6.60% | 31.30% | 62.20% | 90.40% | 52.90% | 47.30% | 71.90% | 5.10% | 98.20% | 9.80% | 71% | 5% | 19% | 5% | 13.6 | 61.2 | 14325.0 |
| 1 | 1 | 601860 | 19,862 | $95,783 | 7,467 | 44% | 3% | 15% | 29% | 1% | 0% | 2% | 6% | 13% | 23.30% | 63.60% | 83.80% | 52.60% | 47.30% | 66.20% | 1.90% | 99.40% | 6.50% | 75% | 5% | 16% | 4% | 15.5 | 79.2 | 21087.0 |
| 2 | 2 | 604740 | 121,444 | $85,444 | 45,337 | 53% | 8% | 11% | 21% | 0% | 0% | 0% | 6% | 5.30% | 16.80% | 77.90% | 91.20% | 45.80% | 54.20% | 80.90% | 3.10% | 98.70% | 4.30% | 68% | 5% | 21% | 6% | 11.8 | 74.3 | 17532.0 |
| 3 | 3 | 607800 | 50,946 | $114,277 | 17,666 | 47% | 5% | 13% | 30% | 1% | 0% | 0% | 5% | 5.30% | 39.20% | 55.50% | 92% | 84% | 16.10% | 71.80% | 4% | 98.20% | 7.90% | 75% | 5% | 16% | 5% | 17.6 | 57.3 | 16312.0 |
| 4 | 4 | 612630 | 11,899 | $102,725 | 6,568 | 40% | 15% | 10% | 29% | 0% | 1% | 0% | 6% | 33.20% | 49.80% | 16.90% | 91.20% | 13% | 87% | 67.50% | 0% | 100% | 22.40% | 47% | 3% | 47% | 3% | 17.6 | 56.5 | 23930.0 |
We will then merge the subsets of both dataframes, making use of NCID_ID as key.
school_district_full_info_df = school_district_info_df_subset.merge(right = school_district_nces_info_df_subset, on = "NCES_ID", how = "outer")
print(school_district_full_info_df.shape)
school_district_full_info_df.head()
(944, 42)
| NCES_ID | CDCode | county_name | district_name | district_type | urban_locale | total_enrolment | charter_school_percentage | non_charter_school_percentage | homeless_student_percentage | migrant_student_percentage | dropout_percentage | suspension_percentage | Unnamed: 0 | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | expenditure_per_student | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 601770 | 161119 | Alameda | Alameda Unified | Unified | 21 - Suburban, Large | 11260 | 16.669627 | 83.330373 | 0.6 | 0.0 | 5.2 | 2.6 | 0 | 78,522 | $104,756 | 30,418 | 43% | 7% | 13% | 30% | 0% | 0% | 1% | 6% | 6.60% | 31.30% | 62.20% | 90.40% | 52.90% | 47.30% | 71.90% | 5.10% | 98.20% | 9.80% | 71% | 5% | 19% | 5% | 13.6 | 61.2 | 14325.0 |
| 1 | 601860 | 161127 | Alameda | Albany City Unified | Unified | 21 - Suburban, Large | 3682 | 0.000000 | 100.000000 | 0.3 | 0.0 | 1.7 | 1.9 | 1 | 19,862 | $95,783 | 7,467 | 44% | 3% | 15% | 29% | 1% | 0% | 2% | 6% | 13% | 23.30% | 63.60% | 83.80% | 52.60% | 47.30% | 66.20% | 1.90% | 99.40% | 6.50% | 75% | 5% | 16% | 4% | 15.5 | 79.2 | 21087.0 |
| 2 | 604740 | 161143 | Alameda | Berkeley Unified | Unified | 12 - City, Midsize | 10194 | 3.757112 | 96.242888 | 2.6 | 0.0 | 10.7 | 1.4 | 2 | 121,444 | $85,444 | 45,337 | 53% | 8% | 11% | 21% | 0% | 0% | 0% | 6% | 5.30% | 16.80% | 77.90% | 91.20% | 45.80% | 54.20% | 80.90% | 3.10% | 98.70% | 4.30% | 68% | 5% | 21% | 6% | 11.8 | 74.3 | 17532.0 |
| 3 | 607800 | 161150 | Alameda | Castro Valley Unified | Unified | 21 - Suburban, Large | 9269 | 0.000000 | 100.000000 | 0.7 | 0.0 | 3.0 | 2.2 | 3 | 50,946 | $114,277 | 17,666 | 47% | 5% | 13% | 30% | 1% | 0% | 0% | 5% | 5.30% | 39.20% | 55.50% | 92% | 84% | 16.10% | 71.80% | 4% | 98.20% | 7.90% | 75% | 5% | 16% | 5% | 17.6 | 57.3 | 16312.0 |
| 4 | 612630 | 161168 | Alameda | Emery Unified | Unified | 21 - Suburban, Large | 743 | 0.000000 | 100.000000 | 0.9 | 0.0 | 0.0 | 3.6 | 4 | 11,899 | $102,725 | 6,568 | 40% | 15% | 10% | 29% | 0% | 1% | 0% | 6% | 33.20% | 49.80% | 16.90% | 91.20% | 13% | 87% | 67.50% | 0% | 100% | 22.40% | 47% | 3% | 47% | 3% | 17.6 | 56.5 | 23930.0 |
# The data types of the dataframe
school_district_full_info_df.dtypes
NCES_ID int64 CDCode int64 county_name object district_name object district_type object urban_locale object total_enrolment int64 charter_school_percentage float64 non_charter_school_percentage float64 homeless_student_percentage float64 migrant_student_percentage float64 dropout_percentage float64 suspension_percentage float64 Unnamed: 0 int64 total_population object median_household_income object total_household object white object black object hispanic_or_latino object asian object american_indian/alaskan_native object hawaiian_and_other_pacific_islander object some_other_race_alone object two_or_more_races object housing_structure_built_2000_and_after object housing_structure_built_1970-1999 object housing_structure_built_before_1970 object household_with_broadband_internet object housing_structure_type_house object housing_structure_type_apartment object speak_english_only_children object under18_with_disability object under18_with_health_insurance object family_income_below_poverty object married_couple_household object cohabitating_couple_household object female_householder_household object male_householder_household object parents_not_in_labor_force object bachelors_or_higher object expenditure_per_student float64 dtype: object
For consistency purposes, we will convert all percentages to decimal approximation of the percent
# Convert to decimal approximation of percentage
# Column names of those we would like to convert
cols_to_decimal_approx = ["charter_school_percentage", "non_charter_school_percentage", "homeless_student_percentage",
"migrant_student_percentage", "dropout_percentage", "suspension_percentage",
"white", "black", "hispanic_or_latino", "asian", "american_indian/alaskan_native",
"hawaiian_and_other_pacific_islander", "some_other_race_alone", "two_or_more_races",
"housing_structure_built_2000_and_after", "housing_structure_built_1970-1999", "housing_structure_built_before_1970",
"household_with_broadband_internet", "housing_structure_type_house", "housing_structure_type_apartment",
"speak_english_only_children", "under18_with_disability", "under18_with_health_insurance",
"family_income_below_poverty", "married_couple_household", "cohabitating_couple_household",
"female_householder_household", "male_householder_household", "parents_not_in_labor_force", "bachelors_or_higher"]
# function to apply conversion
def convert_to_deci(col):
if col.name in cols_to_decimal_approx: # if column name is in the selected columns to convert
return col.map(decimal_approximation, na_action = 'ignore') # apply decimal_approximation function written earlier
else:
return col # if not, return original column
school_district_full_info_df = school_district_full_info_df.apply(convert_to_deci, axis = 0)
school_district_full_info_df.dtypes
NCES_ID int64 CDCode int64 county_name object district_name object district_type object urban_locale object total_enrolment int64 charter_school_percentage float64 non_charter_school_percentage float64 homeless_student_percentage float64 migrant_student_percentage float64 dropout_percentage float64 suspension_percentage float64 Unnamed: 0 int64 total_population object median_household_income object total_household object white float64 black float64 hispanic_or_latino float64 asian float64 american_indian/alaskan_native float64 hawaiian_and_other_pacific_islander float64 some_other_race_alone float64 two_or_more_races float64 housing_structure_built_2000_and_after float64 housing_structure_built_1970-1999 float64 housing_structure_built_before_1970 float64 household_with_broadband_internet float64 housing_structure_type_house float64 housing_structure_type_apartment float64 speak_english_only_children float64 under18_with_disability float64 under18_with_health_insurance float64 family_income_below_poverty float64 married_couple_household float64 cohabitating_couple_household float64 female_householder_household float64 male_householder_household float64 parents_not_in_labor_force float64 bachelors_or_higher float64 expenditure_per_student float64 dtype: object
We will add an additional columns to indicate the majority race of the district.
# function to determine majority race
def majority_race(row):
races_list = ["white", "black", "hispanic_or_latino", "asian", "american_indian/alaskan_native", "hawaiian_and_other_pacific_islander",
"some_other_race_alone", "two_or_more_races"]
values_list = [row[race] for race in races_list]
majority_race = races_list[values_list.index(max(values_list))]
row["majority_race"] = majority_race
if majority_race == "white":
row["white_majority"] = "yes"
else:
row["white_majority"] = "yes"
return row
school_district_full_info_df = school_district_full_info_df.apply(majority_race, axis = 1)
# Merge
school_district_merge_full = school_district_full_info_df.merge(right = test_results_combined, on = "CDCode", how = "left")
# Clean up some of the columns and change data type
school_district_merge_full["total_population"] = school_district_merge_full["total_population"].str.replace(",", "").astype("float")
school_district_merge_full["median_household_income"] = school_district_merge_full["median_household_income"].str.replace(",", "").str.replace("$", "").astype("float")
school_district_merge_full["total_household"] = school_district_merge_full["total_household"].str.replace(",", "").astype("float")
# We will add another column for act average composite score / dollar expenditure per student
school_district_merge_full['act_average_composite_score_per_dollar_expenditure'] = school_district_merge_full['act_average_composite_score'] / school_district_merge_full['expenditure_per_student']
# We will also convert urban_locale to ordered categorical data type (ordered by size)
# We are only doing this after merging as merging causes the categorical dtype to revert to object type
from pandas.api.types import CategoricalDtype
# Base on NCES locale classification : https://nces.ed.gov/programs/edge/docs/locale_classifications.pdf
ordered_categories = np.sort(school_district_full_info_df["urban_locale"].unique())
cat_type = CategoricalDtype(categories= ordered_categories, ordered = True)
school_district_merge_full["urban_locale"] = school_district_merge_full["urban_locale"].astype(cat_type)
school_district_merge_full["urban_locale"].dtype
CategoricalDtype(categories=['11 - City, Large', '12 - City, Midsize', '13 - City, Small',
'21 - Suburban, Large', '22 - Suburban, Midsize',
'23 - Suburban, Small', '31 - Town, Fringe',
'32 - Town, Distant', '33 - Town, Remote',
'41 - Rural, Fringe', '42 - Rural, Distant',
'43 - Rural, Remote'],
ordered=True)
# Subset: rows with either ACT or SAT data
school_district_merge_subset = school_district_merge_full[(~school_district_merge_full["act_enroll"].isnull()) | (~school_district_merge_full["sat_enroll"].isnull())].drop("Unnamed: 0", axis = 1)
print(school_district_merge_subset.shape)
school_district_merge_subset.head()
(475, 63)
| NCES_ID | CDCode | county_name | district_name | district_type | urban_locale | total_enrolment | charter_school_percentage | non_charter_school_percentage | homeless_student_percentage | migrant_student_percentage | dropout_percentage | suspension_percentage | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | expenditure_per_student | majority_race | white_majority | act_enroll | act_num_test_taker | act_participation_rate | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | act_average_composite_score | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | act_average_composite_score_per_dollar_expenditure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 601770 | 161119 | Alameda | Alameda Unified | Unified | 21 - Suburban, Large | 11260 | 0.166696 | 0.833304 | 0.006 | 0.0 | 0.052 | 0.026 | 78522.0 | 104756.0 | 30418.0 | 0.43 | 0.07 | 0.13 | 0.30 | 0.00 | 0.00 | 0.01 | 0.06 | 0.066 | 0.313 | 0.622 | 0.904 | 0.529 | 0.473 | 0.719 | 0.051 | 0.982 | 0.098 | 0.71 | 0.05 | 0.19 | 0.05 | 0.136 | 0.612 | 14325.0 | white | yes | 919.0 | 155.0 | 0.168662 | 27.0 | 26.0 | 26.0 | 25.0 | 134.0 | 0.8645 | 26.00 | 1790.0 | 812.0 | 0.453631 | 677.0 | 0.833744 | 598.0 | 0.736453 | 570.0 | 0.701970 | 0.001815 |
| 1 | 601860 | 161127 | Alameda | Albany City Unified | Unified | 21 - Suburban, Large | 3682 | 0.000000 | 1.000000 | 0.003 | 0.0 | 0.017 | 0.019 | 19862.0 | 95783.0 | 7467.0 | 0.44 | 0.03 | 0.15 | 0.29 | 0.01 | 0.00 | 0.02 | 0.06 | 0.130 | 0.233 | 0.636 | 0.838 | 0.526 | 0.473 | 0.662 | 0.019 | 0.994 | 0.065 | 0.75 | 0.05 | 0.16 | 0.04 | 0.155 | 0.792 | 21087.0 | white | yes | 307.0 | 58.0 | 0.188925 | 28.0 | 27.0 | 27.0 | 26.0 | 51.0 | 0.8793 | 27.00 | 569.0 | 261.0 | 0.458699 | 254.0 | 0.973180 | 238.0 | 0.911877 | 237.0 | 0.908046 | 0.001280 |
| 2 | 604740 | 161143 | Alameda | Berkeley Unified | Unified | 12 - City, Midsize | 10194 | 0.037571 | 0.962429 | 0.026 | 0.0 | 0.107 | 0.014 | 121444.0 | 85444.0 | 45337.0 | 0.53 | 0.08 | 0.11 | 0.21 | 0.00 | 0.00 | 0.00 | 0.06 | 0.053 | 0.168 | 0.779 | 0.912 | 0.458 | 0.542 | 0.809 | 0.031 | 0.987 | 0.043 | 0.68 | 0.05 | 0.21 | 0.06 | 0.118 | 0.743 | 17532.0 | white | yes | 820.0 | 227.0 | 0.276829 | 26.0 | 25.0 | 25.0 | 24.0 | 166.0 | 0.7313 | 25.00 | 1691.0 | 704.0 | 0.416322 | 621.0 | 0.882102 | 525.0 | 0.745739 | 518.0 | 0.735795 | 0.001426 |
| 3 | 607800 | 161150 | Alameda | Castro Valley Unified | Unified | 21 - Suburban, Large | 9269 | 0.000000 | 1.000000 | 0.007 | 0.0 | 0.030 | 0.022 | 50946.0 | 114277.0 | 17666.0 | 0.47 | 0.05 | 0.13 | 0.30 | 0.01 | 0.00 | 0.00 | 0.05 | 0.053 | 0.392 | 0.555 | 0.920 | 0.840 | 0.161 | 0.718 | 0.040 | 0.982 | 0.079 | 0.75 | 0.05 | 0.16 | 0.05 | 0.176 | 0.573 | 16312.0 | white | yes | 737.0 | 106.0 | 0.143826 | 26.0 | 25.0 | 25.0 | 25.0 | 82.0 | 0.7736 | 25.25 | 1496.0 | 661.0 | 0.441845 | 606.0 | 0.916793 | 545.0 | 0.824508 | 527.0 | 0.797277 | 0.001548 |
| 4 | 612630 | 161168 | Alameda | Emery Unified | Unified | 21 - Suburban, Large | 743 | 0.000000 | 1.000000 | 0.009 | 0.0 | 0.000 | 0.036 | 11899.0 | 102725.0 | 6568.0 | 0.40 | 0.15 | 0.10 | 0.29 | 0.00 | 0.01 | 0.00 | 0.06 | 0.332 | 0.498 | 0.169 | 0.912 | 0.130 | 0.870 | 0.675 | 0.000 | 1.000 | 0.224 | 0.47 | 0.03 | 0.47 | 0.03 | 0.176 | 0.565 | 23930.0 | white | yes | 40.0 | 0.0 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 99.0 | 66.0 | 0.666667 | 25.0 | 0.471698 | 16.0 | 0.301887 | 14.0 | 0.264151 | NaN |
After merging the dataframes with the test results and the dataframes with the demographic information, we have extracted 475 school districts with 60 columns each.
Although not all rows have complete information, we will keep all rows for the analysis as at least some data from each row can be useful for our exploratory data analysis.
school_district_merge_subset.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 475 entries, 0 to 943 Data columns (total 63 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NCES_ID 475 non-null int64 1 CDCode 475 non-null int64 2 county_name 475 non-null object 3 district_name 475 non-null object 4 district_type 475 non-null object 5 urban_locale 475 non-null category 6 total_enrolment 475 non-null int64 7 charter_school_percentage 475 non-null float64 8 non_charter_school_percentage 475 non-null float64 9 homeless_student_percentage 475 non-null float64 10 migrant_student_percentage 475 non-null float64 11 dropout_percentage 462 non-null float64 12 suspension_percentage 475 non-null float64 13 total_population 475 non-null float64 14 median_household_income 475 non-null float64 15 total_household 475 non-null float64 16 white 475 non-null float64 17 black 475 non-null float64 18 hispanic_or_latino 475 non-null float64 19 asian 475 non-null float64 20 american_indian/alaskan_native 475 non-null float64 21 hawaiian_and_other_pacific_islander 475 non-null float64 22 some_other_race_alone 475 non-null float64 23 two_or_more_races 475 non-null float64 24 housing_structure_built_2000_and_after 475 non-null float64 25 housing_structure_built_1970-1999 475 non-null float64 26 housing_structure_built_before_1970 475 non-null float64 27 household_with_broadband_internet 475 non-null float64 28 housing_structure_type_house 475 non-null float64 29 housing_structure_type_apartment 475 non-null float64 30 speak_english_only_children 432 non-null float64 31 under18_with_disability 432 non-null float64 32 under18_with_health_insurance 432 non-null float64 33 family_income_below_poverty 432 non-null float64 34 married_couple_household 432 non-null float64 35 cohabitating_couple_household 432 non-null float64 36 female_householder_household 432 non-null float64 37 male_householder_household 432 non-null float64 38 parents_not_in_labor_force 416 non-null float64 39 bachelors_or_higher 416 non-null float64 40 expenditure_per_student 459 non-null float64 41 majority_race 475 non-null object 42 white_majority 475 non-null object 43 act_enroll 460 non-null float64 44 act_num_test_taker 460 non-null float64 45 act_participation_rate 460 non-null float64 46 act_average_reading_score 339 non-null float64 47 act_average_english_score 339 non-null float64 48 act_average_math_score 339 non-null float64 49 act_average_science_score 339 non-null float64 50 act_num_above_average_score 339 non-null float64 51 act_percentage_above_average_score 339 non-null float64 52 act_average_composite_score 339 non-null float64 53 sat_enroll 475 non-null float64 54 sat_num_test_taker 475 non-null float64 55 sat_participation_rate 475 non-null float64 56 sat_num_erw_benchmark 394 non-null float64 57 sat_percentage_erw_benchmark 394 non-null float64 58 sat_num_math_benchmark 394 non-null float64 59 sat_percentage_math_benchmark 394 non-null float64 60 sat_num_both_benchmark 394 non-null float64 61 sat_percentage_both_benchmark 394 non-null float64 62 act_average_composite_score_per_dollar_expenditure 333 non-null float64 dtypes: category(1), float64(54), int64(3), object(5) memory usage: 234.7+ KB
# Export df to csv
#school_district_merge_subset.to_csv("../data/merged_df.csv")
Now that we've fixed our data, and given it appropriate names, let's create a data dictionary.
A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.
Example of a Fictional Data Dictionary Entry:
| Feature | Type | Dataset | Description |
|---|---|---|---|
| county_pop | integer | 2010 census | The population of the county (units in thousands, where 2.5 represents 2500 people). |
| per_poverty | float | 2010 census | The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%) |
Here's a quick link to a short guide for formatting markdown in Jupyter notebooks.
Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. This would be a great thing to copy and paste into your custom README for this project.
Note: if you are unsure of what a feature is, check the source of the data! This can be found in the README.
| Feature | Type | Dataset | Description |
|---|---|---|---|
| NCES_ID | int64 | california_school_district_info.csv | School district identifier for National Center for Education Statistics (NCES) |
| CDCode | int64 | california_school_district_info.csv | Official school district Identifier |
| county_name | object | california_school_district_info.csv | County name |
| district_name | object | california_school_district_info.csv | School district name |
| district_type | object | california_school_district_info.csv | Type of school district |
| urban_locale | object | california_school_district_info.csv | Urban locale of school district (city, suburb, fringe, rural, etc) |
| total_enrolment | int64 | california_school_district_info.csv | Total student enrolment |
| charter_school_percentage | float64 | california_school_district_info.csv | Percentage of students in charter school |
| non_charter_school_percentage | float64 | california_school_district_info.csv | Percentage of students in non-charter school (Public) |
| homeless_student_percentage | float64 | california_school_district_info.csv | Percentage of students who are homeless |
| migrant_student_percentage | float64 | california_school_district_info.csv | Percentage of students who are migrants |
| dropout_percentage | float64 | california_school_district_info.csv | Percentage of students who dropped out |
| suspension_percentage | float64 | california_school_district_info.csv | Percentage of students suspended |
| total_population | float64 | california_school_district_NCES_info.csv | Total population of district |
| median_household_income | float64 | california_school_district_NCES_info.csv | Median houshold income of district |
| total_household | float64 | california_school_district_NCES_info.csv | Total household number of district |
| white | float64 | california_school_district_NCES_info.csv | Percentage of district population who are white |
| black | float64 | california_school_district_NCES_info.csv | Percentage of district population who are black |
| hispanic_or_latino | float64 | california_school_district_NCES_info.csv | Percentage of district population who are hispanic_or_latino |
| asian | float64 | california_school_district_NCES_info.csv | Percentage of district population who are asian |
| american_indian/alaskan_native | float64 | california_school_district_NCES_info.csv | Percentage of district population who are american indian/alaskan native |
| hawaiian_and_other_pacific_islander | float64 | california_school_district_NCES_info.csv | Percentage of district population who are hawaiian and other pacific islanders |
| some_other_race_alone | float64 | california_school_district_NCES_info.csv | Percentage of district population who are from other races |
| two_or_more_races | float64 | california_school_district_NCES_info.csv | Percentage of district population who are from or mores |
| housing_structure_built_2000_and_after | float64 | california_school_district_NCES_info.csv | Percentage of houses built 2000 and after |
| housing_structure_built_1970-1999 | float64 | california_school_district_NCES_info.csv | Percentage of houses built from 1970 to 1999 |
| housing_structure_built_before_1970 | float64 | california_school_district_NCES_info.csv | Percentage of houses built before 1970s |
| household_with_broadband_internet | float64 | california_school_district_NCES_info.csv | Percentage of households with internet broadband |
| housing_structure_type_house | float64 | california_school_district_NCES_info.csv | Percentage of district population who live in houses |
| housing_structure_type_apartment | float64 | california_school_district_NCES_info.csv | Percentage of district population who live in apartments |
| speak_english_only_children | float64 | california_school_district_NCES_info.csv | Percentage of district population who speak english ony |
| under18_with_disability | float64 | california_school_district_NCES_info.csv | Percentage of students with disability |
| under18_with_health_insurance | float64 | california_school_district_NCES_info.csv | Percentage of students with health insurance coverage |
| family_income_below_poverty | float64 | california_school_district_NCES_info.csv | Percentage of students with family income below poverty level |
| married_couple_household | float64 | california_school_district_NCES_info.csv | Percentage of students who are from married couple households |
| cohabitating_couple_household | float64 | california_school_district_NCES_info.csv | Percentage of students who are from cohabitating couple households |
| female_householder_household | float64 | california_school_district_NCES_info.csv | Percentage of students who are from households with female householder only |
| male_householder_household | float64 | california_school_district_NCES_info.csv | Percentage of students who are from households with male householder only |
| parents_not_in_labor_force | float64 | california_school_district_NCES_info.csv | Percentage of students with parents not in labor force |
| bachelors_or_higher | float64 | california_school_district_NCES_info.csv | Pencentage of students with parents who possess at least a bachelor's degree |
| expenditure_per_student | float64 | california_school_district_NCES_info.csv | Expenditure per student of the school district |
| majority_race | object | california_school_district_NCES_info.csv | The majority race of the district |
| white_majority | object | california_school_district_NCES_info.csv | Whether the majority race of the district is white |
| act_enroll | float64 | act_2019_ca.csv | Enrollment of Grade 12 |
| act_num_test_taker | float64 | act_2019_ca.csv | Number of Test Takers for ACT |
| act_participation_rate | float64 | act_2019_ca.csv | Participation Rate of Grade 12 in ACT |
| act_average_reading_score | float64 | act_2019_ca.csv | Average ACT Reading Score |
| act_average_english_score | float64 | act_2019_ca.csv | Average ACT English Score |
| act_average_math_score | float64 | act_2019_ca.csv | Average ACT Math Score |
| act_average_science_score | float64 | act_2019_ca.csv | Average ACT Science Score |
| act_num_above_average_score | float64 | act_2019_ca.csv | Number of Test Takers Whose ACT Composite Scores Are Greater or Equal to 21. |
| act_percentage_above_average_score | float64 | act_2019_ca.csv | Percent of Test Takers Whose ACT Composite Scores Are Greater or Equal to 21 |
| sat_enroll | float64 | sat_2019_ca.csv | Enrollment of Grade 12 and Grade 11 |
| sat_num_test_taker | float64 | sat_2019_ca.csv | Number of Test Takers for SAT |
| sat_participation_rate | float64 | sat_2019_ca.csv | Participation Rate of Grade 12 and 11 in ACT |
| sat_num_erw_benchmark | float64 | sat_2019_ca.csv | The number meeting the Evidence-Based Reading & Writing (ERW) benchmark established by the College Board based on the New 2016 SAT test format |
| sat_percentage_erw_benchmark | float64 | sat_2019_ca.csv | The percent of students who met or exceeded the benchmark for Evidence-Based Reading & Writing (ERW) test |
| sat_num_math_benchmark | float64 | sat_2019_ca.csv | The number of students who met or exceeded the benchmark for the New SAT Math test format |
| sat_percentage_math_benchmark | float64 | sat_2019_ca.csv | The percent of students who met or exceeded the benchmark for SAT Math test |
| sat_num_both_benchmark | float64 | sat_2019_ca.csv | The total number of students who met the benchmark of both Evidence-Based Reading & Writing (ERW) and Math |
| sat_percentage_both_benchmark | float64 | sat_2019_ca.csv | The percent of students who met the benchmark of both Evidence-Based Reading & Writing (ERW) and Mat |
# Summary of main statistics for each variable
school_district_merge_subset.drop(["NCES_ID", "CDCode"], axis = 1).describe()
| total_enrolment | charter_school_percentage | non_charter_school_percentage | homeless_student_percentage | migrant_student_percentage | dropout_percentage | suspension_percentage | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | expenditure_per_student | act_enroll | act_num_test_taker | act_participation_rate | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | act_average_composite_score | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | act_average_composite_score_per_dollar_expenditure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 462.000000 | 475.000000 | 4.750000e+02 | 475.000000 | 4.750000e+02 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 475.000000 | 432.000000 | 432.000000 | 432.000000 | 432.000000 | 432.000000 | 432.000000 | 432.000000 | 432.000000 | 416.000000 | 416.000000 | 459.000000 | 460.000000 | 460.000000 | 460.000000 | 339.000000 | 339.000000 | 339.000000 | 339.000000 | 339.000000 | 339.000000 | 339.000000 | 475.000000 | 475.000000 | 475.000000 | 394.000000 | 394.000000 | 394.000000 | 394.000000 | 394.000000 | 394.000000 | 333.000000 |
| mean | 10642.488421 | 0.121028 | 0.878972 | 0.031722 | 0.014307 | 0.088385 | 0.042617 | 8.430221e+04 | 73029.294737 | 2.794049e+04 | 0.472463 | 0.034674 | 0.360147 | 0.087768 | 0.012000 | 0.002400 | 0.001263 | 0.029432 | 0.166619 | 0.461312 | 0.372063 | 0.831636 | 0.743693 | 0.256257 | 0.612208 | 0.046002 | 0.966125 | 0.157308 | 0.673032 | 0.078889 | 0.188727 | 0.061898 | 0.202906 | 0.276690 | 14831.448802 | 1027.202174 | 177.552174 | 0.161152 | 22.604720 | 21.702065 | 21.808260 | 21.808260 | 131.908555 | 0.559633 | 21.980826 | 1966.444211 | 785.406316 | 0.318992 | 652.898477 | 0.754140 | 476.781726 | 0.547429 | 453.928934 | 0.524212 | 0.001573 |
| std | 30288.521502 | 0.243808 | 0.243808 | 0.047381 | 0.030067 | 0.123798 | 0.030278 | 2.409532e+05 | 28810.334871 | 8.234240e+04 | 0.242464 | 0.046179 | 0.236082 | 0.120654 | 0.039367 | 0.005822 | 0.003799 | 0.017317 | 0.111625 | 0.136602 | 0.185034 | 0.093556 | 0.127698 | 0.127669 | 0.215151 | 0.029948 | 0.029698 | 0.100918 | 0.103515 | 0.049086 | 0.069386 | 0.039570 | 0.053031 | 0.204948 | 5072.354895 | 2320.361334 | 501.243186 | 0.164227 | 3.300992 | 3.687059 | 3.155036 | 2.936476 | 228.090713 | 0.234052 | 3.225958 | 4555.676636 | 2653.105664 | 0.172016 | 1520.318818 | 0.158037 | 956.769582 | 0.213065 | 895.982009 | 0.214770 | 0.000615 |
| min | 28.000000 | 0.000000 | 0.004554 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.100000e+01 | 21648.000000 | 1.400000e+01 | 0.010000 | 0.000000 | 0.030000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.061000 | 0.016000 | 0.367000 | 0.130000 | 0.003000 | 0.017000 | 0.000000 | 0.705000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.068000 | 0.000000 | 490.000000 | 13.000000 | 0.000000 | 0.000000 | 15.000000 | 14.000000 | 15.000000 | 15.000000 | 0.000000 | 0.000000 | 14.750000 | 3.000000 | 0.000000 | 0.000000 | 8.000000 | 0.221374 | 1.000000 | 0.007634 | 1.000000 | 0.007634 | 0.000623 |
| 25% | 1552.000000 | 0.000000 | 0.891681 | 0.005000 | 0.000000 | 0.029000 | 0.021000 | 9.817000e+03 | 51863.000000 | 3.170000e+03 | 0.260000 | 0.010000 | 0.160000 | 0.010000 | 0.000000 | 0.000000 | 0.000000 | 0.020000 | 0.088500 | 0.372000 | 0.231500 | 0.787500 | 0.674000 | 0.165500 | 0.456500 | 0.029000 | 0.955000 | 0.081000 | 0.620000 | 0.050000 | 0.140000 | 0.040000 | 0.168000 | 0.116000 | 12568.000000 | 146.500000 | 12.000000 | 0.065313 | 20.000000 | 19.000000 | 19.000000 | 19.000000 | 19.000000 | 0.356400 | 19.250000 | 259.500000 | 52.500000 | 0.218001 | 81.500000 | 0.647120 | 53.000000 | 0.368731 | 50.000000 | 0.343885 | 0.001250 |
| 50% | 4466.000000 | 0.000000 | 1.000000 | 0.013000 | 0.000000 | 0.054500 | 0.038000 | 3.414700e+04 | 67258.000000 | 1.138400e+04 | 0.490000 | 0.020000 | 0.300000 | 0.040000 | 0.000000 | 0.000000 | 0.000000 | 0.030000 | 0.144000 | 0.466000 | 0.346000 | 0.854000 | 0.752000 | 0.248000 | 0.627500 | 0.041000 | 0.972000 | 0.137500 | 0.670000 | 0.070000 | 0.190000 | 0.060000 | 0.194000 | 0.222000 | 14067.000000 | 411.000000 | 54.500000 | 0.124155 | 23.000000 | 22.000000 | 22.000000 | 22.000000 | 56.000000 | 0.573800 | 22.000000 | 769.000000 | 214.000000 | 0.300574 | 269.000000 | 0.789684 | 186.500000 | 0.571429 | 175.500000 | 0.539130 | 0.001507 |
| 75% | 11831.000000 | 0.108319 | 1.000000 | 0.039500 | 0.015000 | 0.095500 | 0.057000 | 9.298150e+04 | 88252.500000 | 3.060950e+04 | 0.680000 | 0.040000 | 0.530000 | 0.110000 | 0.010000 | 0.000000 | 0.000000 | 0.040000 | 0.215000 | 0.550000 | 0.489500 | 0.899000 | 0.835000 | 0.326000 | 0.791000 | 0.056000 | 0.983000 | 0.213500 | 0.740000 | 0.100000 | 0.220000 | 0.080000 | 0.236500 | 0.382000 | 16164.500000 | 1245.750000 | 192.250000 | 0.208853 | 25.000000 | 24.000000 | 24.000000 | 24.000000 | 149.500000 | 0.759550 | 24.250000 | 2383.500000 | 803.500000 | 0.416368 | 745.000000 | 0.887244 | 540.500000 | 0.708018 | 515.250000 | 0.684951 | 0.001784 |
| max | 607723.000000 | 0.995446 | 1.000000 | 0.368000 | 0.205000 | 0.938000 | 0.259000 | 4.723575e+06 | 224205.000000 | 1.608570e+06 | 0.890000 | 0.430000 | 0.980000 | 0.660000 | 0.530000 | 0.060000 | 0.030000 | 0.110000 | 0.886000 | 0.907000 | 0.926000 | 1.000000 | 0.997000 | 0.870000 | 1.000000 | 0.267000 | 1.000000 | 0.530000 | 1.000000 | 0.400000 | 0.520000 | 0.400000 | 0.425000 | 0.905000 | 47881.000000 | 41873.000000 | 9336.000000 | 1.550000 | 30.000000 | 31.000000 | 31.000000 | 29.000000 | 2774.000000 | 1.000000 | 30.250000 | 84155.000000 | 53343.000000 | 1.083333 | 26292.000000 | 1.000000 | 14927.000000 | 0.959770 | 13773.000000 | 0.948276 | 0.008454 |
Just looking at the ACT and SAT data, we see that:
# Using the function defined earlier, get standard deviation for numeric columns
sd = {school_district_merge_subset[col_name].name : standard_deviation(school_district_merge_subset[~school_district_merge_subset[col_name].isnull()][col_name])
for col_name in school_district_merge_subset.columns
if school_district_merge_subset[col_name].dtype == 'float64' or school_district_merge_subset[col_name].dtype == 'int64'}
sd
{'NCES_ID': 15364.673625764257,
'CDCode': 1552763.0351146937,
'total_enrolment': 30256.622049628702,
'charter_school_percentage': 0.24355160909302706,
'non_charter_school_percentage': 0.24355160909302712,
'homeless_student_percentage': 0.04733071418293694,
'migrant_student_percentage': 0.030035457212354275,
'dropout_percentage': 0.12366432215319996,
'suspension_percentage': 0.030246157198029942,
'total_population': 240699.39097648003,
'median_household_income': 28779.992224450754,
'total_household': 82255.67693456516,
'white': 0.24220834052857607,
'black': 0.04613080763158581,
'hispanic_or_latino': 0.2358335259600702,
'asian': 0.12052721488977464,
'american_indian/alaskan_native': 0.039325898791883813,
'hawaiian_and_other_pacific_islander': 0.005815406036419224,
'some_other_race_alone': 0.003795317132067337,
'two_or_more_races': 0.01729901282950408,
'housing_structure_built_2000_and_after': 0.11150748881599773,
'housing_structure_built_1970-1999': 0.13645812842515614,
'housing_structure_built_before_1970': 0.1848387297377915,
'household_with_broadband_internet': 0.09345702070531445,
'housing_structure_type_house': 0.12756388801166946,
'housing_structure_type_apartment': 0.12753415777960833,
'speak_english_only_children': 0.21490172845633077,
'under18_with_disability': 0.02991318448030473,
'under18_with_health_insurance': 0.029664062385691917,
'family_income_below_poverty': 0.1008011433150082,
'married_couple_household': 0.1033952272778294,
'cohabitating_couple_household': 0.049029154843739264,
'female_householder_household': 0.06930541252242843,
'male_householder_household': 0.03952388865418509,
'parents_not_in_labor_force': 0.052967319018564864,
'bachelors_or_higher': 0.20470121869820232,
'expenditure_per_student': 5066.826441311815,
'act_enroll': 2317.8378296977285,
'act_num_test_taker': 500.6980599901328,
'act_participation_rate': 0.1640482368897844,
'act_average_reading_score': 3.2961193095197157,
'act_average_english_score': 3.6816168351503094,
'act_average_math_score': 3.150378950449397,
'act_average_science_score': 2.9321415216725204,
'act_num_above_average_score': 227.7540473669151,
'act_percentage_above_average_score': 0.23370616824910612,
'act_average_composite_score': 3.221196560657854,
'sat_enroll': 4550.878660162136,
'sat_num_test_taker': 2650.3114496546696,
'sat_participation_rate': 0.17183455683977428,
'sat_num_erw_benchmark': 1518.388253435289,
'sat_percentage_erw_benchmark': 0.1578366890178709,
'sat_num_math_benchmark': 955.5546363323822,
'sat_percentage_math_benchmark': 0.2127947586462225,
'sat_num_both_benchmark': 894.8442538254612,
'sat_percentage_both_benchmark': 0.21449749463849377,
'act_average_composite_score_per_dollar_expenditure': 0.0006136393122987683}
We will use percentage of students who meets the benchmark for both SAT and ACT.
# ACT Top 5
school_district_merge_subset.loc[~school_district_merge_subset["act_percentage_above_average_score"].isnull()].sort_values(
by = "act_percentage_above_average_score", ascending = False)[["district_name", "act_percentage_above_average_score"]].head()
| district_name | act_percentage_above_average_score | |
|---|---|---|
| 569 | Lakeside Union Elementary | 1.0000 |
| 677 | Los Gatos-Saratoga Union High | 0.9610 |
| 12 | Piedmont City Unified | 0.9483 |
| 671 | Fremont Union High | 0.9427 |
| 278 | La Canada Unified | 0.9420 |
# SAT Top 5
school_district_merge_subset.loc[~school_district_merge_subset["sat_percentage_both_benchmark"].isnull()].sort_values(
by = "sat_percentage_both_benchmark", ascending = False)[["district_name", "sat_percentage_both_benchmark"]].head()
| district_name | sat_percentage_both_benchmark | |
|---|---|---|
| 278 | La Canada Unified | 0.948276 |
| 300 | San Marino Unified | 0.943662 |
| 686 | Palo Alto Unified | 0.943284 |
| 677 | Los Gatos-Saratoga Union High | 0.941834 |
| 321 | Manhattan Beach Unified | 0.929577 |
# ACT Bottom 5
school_district_merge_subset.loc[~school_district_merge_subset["act_percentage_above_average_score"].isnull()].sort_values(
by = "act_percentage_above_average_score", ascending = False)[["district_name", "act_percentage_above_average_score"]].tail()
| district_name | act_percentage_above_average_score | |
|---|---|---|
| 107 | Washington Unified | 0.0826 |
| 315 | Compton Unified | 0.0717 |
| 387 | Tulelake Basin Joint Unified | 0.0667 |
| 228 | Reef-Sunset Unified | 0.0357 |
| 103 | Golden Plains Unified | 0.0000 |
# SAT Top 5
school_district_merge_subset.loc[~school_district_merge_subset["sat_percentage_both_benchmark"].isnull()].sort_values(
by = "sat_percentage_both_benchmark", ascending = False)[["district_name", "sat_percentage_both_benchmark"]].tail()
| district_name | sat_percentage_both_benchmark | |
|---|---|---|
| 414 | Calistoga Joint Unified | 0.127660 |
| 99 | Firebaugh-Las Deltas Unified | 0.109489 |
| 899 | Farmersville Unified | 0.098361 |
| 160 | San Pasqual Valley Unified | 0.083333 |
| 103 | Golden Plains Unified | 0.007634 |
We see certain recurring names in top 5 and bottom 5 of both tests.
These are school districts which consistently do well or not so well.
plt.figure(figsize = (16,12))
# composite score
# top-right: ACT composite score
ax0 = plt.subplot(322)
sns.histplot(school_district_merge_subset["act_average_composite_score"] , bins = 20, stat= "probability", kde = True, color = "#31838C")
ax0.set_xlabel("ACT average composite score")
ax0.set_ylabel("Percentage")
# Participation rate
# second-left: ACT participation rate
ax1 = plt.subplot(323)
sns.histplot(school_district_merge_subset[school_district_merge_subset["act_participation_rate"]<=1]["act_participation_rate"] , bins = 20, stat= "probability", kde = True, color = "firebrick")
ax1.set_xlabel("ACT participation rate")
ax1.set_ylabel("Percentage")
# bottom-left: SAT participation rate
ax2 = plt.subplot(325)
sns.histplot(school_district_merge_subset[school_district_merge_subset["sat_participation_rate"]<=1]["sat_participation_rate"] , bins = 20, stat= "probability", kde = True, color = "firebrick", cbar_kws = {'sharex' : ax1})
ax2.set_xlabel("SAT participation rate")
ax2.set_ylabel("Percentage")
ax2.xaxis.tick_top()
ax2.xaxis.set_label_position('top')
ax2.invert_yaxis()
# Benchmark percentage
# second-right: ACT benchmark percentage
ax3 = plt.subplot(324)
sns.histplot(school_district_merge_subset[school_district_merge_subset["act_percentage_above_average_score"]<=1]["act_percentage_above_average_score"] , bins = 20, stat= "probability", kde = True, color = "#31838C")
ax3.set_xlabel("ACT benchmark above average")
ax3.set_ylabel("Percentage")
# bottom-right: SAT benchmark percentage
ax4 = plt.subplot(326)
sns.histplot(school_district_merge_subset[school_district_merge_subset["sat_percentage_both_benchmark"]<=1]["sat_percentage_both_benchmark"] , bins = 20, stat= "probability", kde = True, color = "#31838C", cbar_kws = {'sharex' : ax3})
ax4.set_xlabel("SAT benchmark above average")
ax4.set_ylabel("Percentage")
ax4.xaxis.tick_top()
ax4.xaxis.set_label_position('top')
ax4.invert_yaxis()
plt.tight_layout();
In terms of participation rate. SAT seems to be the more popular choice amongst students in California, with the median at about 0.32 compared to about 0.16 for ACT. More ACT participation rate for districts tend to be below 0.3 percent, while there is still sizeable participation rate for SAT which are more than 0.3. This perhaps means that SAT is a more representative indicator for California students as it takes a larger sample size of the cohort.
In terms of percentage benchmark above average, both tests are similar in terms of distribution. With a rather large spread and a more obvious trough (just above 0.4) in between 2 crests.
plt.scatter(school_district_merge_subset["act_percentage_above_average_score"], school_district_merge_subset["sat_percentage_both_benchmark"], color = "#2B5A79")
plt.xlabel("ACT percentage above average score")
plt.ylabel("SAT percentage both benchmark");
plt.title("Comparing ACT and SAT benchmark percentages");
plt.scatter(school_district_merge_subset["act_average_composite_score"], school_district_merge_subset["sat_percentage_both_benchmark"], color = "#2B5A79")
plt.xlabel("ACT average composite score")
plt.ylabel("SAT percentage both benchmark");
plt.title("Comparing ACT composite score with SAT benchmark percentages");
school_district_merge_subset[["act_percentage_above_average_score", "sat_percentage_both_benchmark"]].corr()
| act_percentage_above_average_score | sat_percentage_both_benchmark | |
|---|---|---|
| act_percentage_above_average_score | 1.000000 | 0.869943 |
| sat_percentage_both_benchmark | 0.869943 | 1.000000 |
By comparing between the benchmark percentages for both ACT and SAT, we see a high level of correlation between the two tests. As such, it should not matter that much which indicator we used.
For SAT, it has the advantage of more students in California taking it.
For ACT, it has the composite score, which is the average score across subjects for a district (while SAT benchmark makes the percentage of test takers scoring above a certain level), this may offer a higher correlation with the independent variables.
plt.scatter(school_district_merge_subset["sat_participation_rate"], school_district_merge_subset["sat_percentage_both_benchmark"], color = "#31838C", alpha = 0.8)
plt.xlabel("SAT participation rate")
plt.ylabel("SAT percentage both benchmark");
plt.title("Comparing SAT participation rate and benchmark percentages");
school_district_merge_subset[["sat_participation_rate", "sat_percentage_both_benchmark"]].corr()
| sat_participation_rate | sat_percentage_both_benchmark | |
|---|---|---|
| sat_participation_rate | 1.000000 | -0.058162 |
| sat_percentage_both_benchmark | -0.058162 | 1.000000 |
There does not seem to be a significant correlation between participation rate and benchmark.
# Get correlation between different variables
# As we clearly know that the aggregated scores and benchmarks are derived from individual subjects, we will not visualize its constituents
corr = school_district_merge_subset.drop(["NCES_ID", "CDCode", "act_enroll", "act_num_test_taker",
"act_average_reading_score", "act_average_english_score",
"act_average_math_score", "act_average_science_score",
"act_num_above_average_score", "act_percentage_above_average_score",
"sat_num_erw_benchmark", "sat_percentage_erw_benchmark",
"sat_num_math_benchmark", "sat_percentage_math_benchmark", "sat_num_both_benchmark"], axis = 1).corr()
plt.figure(figsize = (16,16))
sns.heatmap(corr, square = True, annot = False, cmap = "RdBu")
<AxesSubplot:>
We will extract variables with with correlation larger than or equals to 0.5 for ACT participation rate and act average composite score.
corr_subset = corr[(corr >= 0.5) | (corr <= -0.5)]["act_average_composite_score"].dropna(how = 'all')
plt.figure(figsize = (10,10))
sns.heatmap(corr.loc[corr_subset.index, corr_subset.index], square = True, annot = True, cmap = "RdBu")
corr_subset
median_household_income 0.741621 white 0.510368 hispanic_or_latino -0.734650 household_with_broadband_internet 0.670722 speak_english_only_children 0.515659 family_income_below_poverty -0.690693 married_couple_household 0.592471 cohabitating_couple_household -0.537570 bachelors_or_higher 0.832104 act_average_composite_score 1.000000 sat_percentage_both_benchmark 0.876822 Name: act_average_composite_score, dtype: float64
The above independent variables have at least a 0.5 correlation with act average composite score. Next we will examine the variables more closely.
Even before deciding how and where its best to spend the education budget, it is perhaps wise to see if money actually brings better results. Since both the government level education budget and parental investment can potentially bring about results, we will analyse results with regards to both expenditure per student and medium household income of each education district.
Expenditure per student refers to district level educational spending per student, and consists of the following areas:
Refer to Source for more details
These are money that is spent directly to support the educational system. One would expect that directly investing in the education system would bring about better result, but is that the case?
We will see if increasing spending is actually correlated with higher percentage of students scoring above benchmark, which would indicate a higher chance of success in college.
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = "expenditure_per_student", y = "sat_percentage_both_benchmark", color = "#2B5A79", s = 100, alpha = 0.7)
plt.xlabel("Expenditure per student")
plt.ylabel("SAT percentage above benchmark")
plt.title("Does increasing educational expenditure increase SAT result?")
school_district_merge_subset[["expenditure_per_student", "sat_percentage_both_benchmark"]].corr()
| expenditure_per_student | sat_percentage_both_benchmark | |
|---|---|---|
| expenditure_per_student | 1.000000 | 0.027872 |
| sat_percentage_both_benchmark | 0.027872 | 1.000000 |
Looking at the correlation, the correlation of 0.028 does not show a strong correlation between expenditure per student and the SAT benchmark result.
However, we do see a concentration of expenditure between 10,000 USD and 20,000 USD per student, but within this range of expenditures, we see a rather large disparity for the SAT score.
We will perform another analysis by grouping the expenditure into different bins.
# Create 3 "bins" according to expenditure per student
def expenditure_bins(row):
if row["expenditure_per_student"] < 10000:
row["expenditure_level"] = "less than 10000"
elif row["expenditure_per_student"] > 20000:
row["expenditure_level"] = "more than 20000"
else:
row["expenditure_level"] = "between 10000 to 20000"
return row
# Plot violinplot
plt.figure(figsize = (8,8))
sns.boxplot(data = school_district_merge_subset.apply(expenditure_bins, axis = 1), x = "expenditure_level", y = "sat_percentage_both_benchmark",
order = ["less than 10000", "between 10000 to 20000", "more than 20000"], palette = "crest")
plt.xticks(rotation = 45);
plt.xlabel("Expenditure per student")
plt.ylabel("SAT percentage above benchmark");
plt.title("Does increasing educational expenditure increase SAT result? \nGroup by 3 expenditure levels");
By grouping the expenditure per student into 3 groups, we do see an observation:
But still, shoud we simply increase funding for schools without thinking about how they spend it?
Another angle which we may approach the expenditure debate from that of charter vs non-charter schools.
Just for a basic background information, charter and non-charter schools are both public schools in that they receive funding from the government. The two differs in that charter schools are run by independent groups, even non-profit groups, and they do not need to adhere to most guidelines which government non-charter public schools.Essentially, more freedom in the way the school is ran. However, they also tend to receive less funding from the government than non-charter schools Souce, in part because they are not funded for facilities. This may also mean that facilities cost may sometimes encroach into the operating cost, hence less spending per student.
Let us first see if school districts with a higher percentage of charter schools indeed spend lesser on students.
# Corr coef
print("Correlation between charter school percentage and expediture: ",
school_district_merge_subset[['charter_school_percentage', 'expenditure_per_student']].corr().iloc[1,0])
# Plotting charter school percentage vs expenditure
plt.figure(figsize = (12,8))
sns.regplot(data = school_district_merge_subset, x = "charter_school_percentage", y = "expenditure_per_student", ci= None,
scatter_kws = {'s':100, 'alpha': 0.7, 'color': "#2B5A79"},
line_kws = {"linewidth": 3, 'color': 'firebrick', 'alpha': 0.7})
plt.xlabel("Charter school percentage of school district")
plt.ylabel("Expenditure per student")
plt.xlim(-0.02, 1)
plt.title("Do charter schools spend less on students?");
Correlation between charter school percentage and expediture: -0.4781707916995651
There is a -0.48 correlation, not a particularly strong one, but considering that we are looking at the correlation for the entire district (remember that each district consists of both charter and non-charter schools) consisting both charter and non-charter schools, we will accept this correlation coefficient as sufficient for now.
From the scatter plot, we see that the values for districts with 0% charter school can reach a higher level of expenditure per student and overall we do see a decreasing trend.
Now let us see the ACT performance for districts with different percentages of charter schools.
plt.figure(figsize = (12,8))
sns.regplot(data = school_district_merge_subset, x = "charter_school_percentage", y = "act_average_composite_score", ci= None,
scatter_kws = {'s':100, 'alpha': 0.7, 'color': "#2B5A79"},
line_kws = {"linewidth": 3, 'color': 'firebrick', 'alpha': 0.7})
plt.xlabel("Charter school percentage of school district")
plt.ylabel("ACT average composite score")
plt.title("Do districts with a higher percentage of charter schools perform worse in ACT?");
print("Correlation between charter school percentage and ACT average composite score: ",
school_district_merge_subset[['charter_school_percentage', 'act_average_composite_score']].corr().iloc[1,0])
Correlation between charter school percentage and ACT average composite score: -0.14897158256143714
There appears to be decreasing trend, which may suggest that districts with a higher percentage of charter schools do in fact perform worse that those that do not.
However, the correlation coefficient is closer to 0 than the one with expenditure, which suggests that ACT performance depends lesser on expenditure.
To see if it is indeed the case, let us introduce another metric: ACT score per dollar spent on student. This would focus on how well the money is used by charter school judged by ACT achievement. So if a district with higher charter school percentage gets a higher per dollar expenditure ACT score, we would know that the teaching quality or other non-expenditure factors are actually making the charter schools more high performing in ACT. It can then be said that it is able to make more efficient use of expenditure.
This would also suggests that the effect of increasing expenditure may be limited and it would probably not be wise to further increase expenditure without improving efficiency.
# Corr coef
print("Correlation between charter school percentage and score per dollar spent: ",
school_district_merge_subset[['charter_school_percentage', 'act_average_composite_score_per_dollar_expenditure']].corr().iloc[1,0])
# Plotting charter school percentage vs expenditure per student
plt.figure(figsize = (12,8))
sns.regplot(x = school_district_merge_subset["charter_school_percentage"], y = school_district_merge_subset["act_average_composite_score_per_dollar_expenditure"], ci= None,
scatter_kws = {'s':100, 'alpha': 0.7, 'color': "#2B5A79"},
line_kws = {"linewidth": 3, 'color': 'firebrick', 'alpha': 0.7})
plt.xlabel("Charter school percentage of school district")
plt.ylabel("ACT score per dollar expenditure spent on student")
plt.title("Do charter schools perform better in ACT measured in terms of score per dollar spent?");
Correlation between charter school percentage and score per dollar spent: 0.559328300037938
We see that charter schools actually make use of their expenditures more efficiently, when judged in terms of ACT score per dollar spent on student. This shows that simply increasing funding to schools is not a sure way to improve test performance.
Another financial aspect which may influence test performance is the affluence of a students family background. A wealthier family could afford to provide more financial support to a student's academic development.
Examining whether this is the case could guide decision makers in better allocating resources. Perhaps more funds should go to support lower income families, perhaps a more long term approach is to improve the overall financial well-being of the community.
From earlier analysis, median household income has a moderate positive correlation of 0.65 with SAT benchmark percentage. We will take a closer look on the pattern.
# Plot scatter plot of median household income
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = "median_household_income", y = "act_average_composite_score", color = "#2B5A79", s = 100, alpha = 0.7)
plt.xlabel("Median household income")
plt.ylabel("ACT average composite score")
plt.title("Do school districts with higher median household income perform better?");
The scatter plot shows a correlation between median household income of a school district and the SAT benchmark percentage. As per previous case, we will visualize the data again after grouping the income into different bins.
Let us look at income in terms of different percentiles. We will group the the income by percentile.
# percentile info for rows with SAT benchmark percentage info
school_district_merge_subset_sat_benchmark = school_district_merge_subset[~school_district_merge_subset['sat_percentage_both_benchmark'].isnull()]
school_district_merge_subset_sat_benchmark['median_household_income'].describe()
count 394.000000 mean 76005.375635 std 28728.683261 min 22250.000000 25% 55971.750000 50% 70379.000000 75% 91057.250000 max 224205.000000 Name: median_household_income, dtype: float64
# Create 4 "bins" according to quantile - <25%, 25-50%, 50-75%, >75%
def hh_income_bins(row):
if row["median_household_income"] < school_district_merge_subset['median_household_income'].quantile(0.25):
row["household_income_level"] = "below 25 percentile"
elif row["median_household_income"] >= school_district_merge_subset['median_household_income'].quantile(0.25) and row["median_household_income"] < school_district_merge_subset['median_household_income'].quantile(0.5):
row["household_income_level"] = "25-50 percentile"
elif row["median_household_income"] >= school_district_merge_subset['median_household_income'].quantile(0.5) and row["median_household_income"] < school_district_merge_subset['median_household_income'].quantile(0.75):
row["household_income_level"] = "50-75 percentile"
else:
row["household_income_level"] = "above 75 percentile"
return row
# Plot boxplot
plt.figure(figsize = (8,8))
sns.boxplot(data = school_district_merge_subset_sat_benchmark.apply(hh_income_bins, axis = 1), x = "household_income_level", y = "sat_percentage_both_benchmark",
order = ["below 25 percentile", "25-50 percentile", "50-75 percentile", "above 75 percentile"], palette = "crest")
plt.xticks(rotation = 45);
plt.xlabel("Median household income")
plt.ylabel("SAT percentage above benchmark");
plt.title("Do school districts with higher median household income perform better?");
Here we are dealt the cold hard truth that money does seem to buy results, to the extend that students from wealthier families does perform in general better than their less wealthy counterparts. The money effect is especially more obvious for the top 75th percentile of median household income, where its 25th percentile of SAT benchmark percentage is almost as good as the 75th percentile for districts one median house income bracket lower.
In case, you ask if schools in districts with wealthier family also spent more on students, the answer is... not really. Quite the opposite, poorer districts can actually spend more in terms of school expenditure per student.
So you can see that family's financial background does play a large role in students test performance. And this can be a problem, because it further reduces the social mobility in the society as the advantage snowballs. All the more important that we dedicate resources to create a level playing field for students!
# Plot scatter plot of median household income vs expenditure per student
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = "median_household_income", y = "expenditure_per_student", color = "#2B5A79", s = 100, alpha = 0.7)
plt.xlabel("Median household income")
plt.ylabel("Expenditure per student")
plt.title("Do school in districts with higher median household income also spend more in terms of expenditure per student?");
Furthermore, as we see in many parts of the world, income disparity is often correlated with a series of deeper social divisions and inequality. To improve academic results at its root, we will need to identify these factors so as to provide more targeted recommendations. We shall explore these further.
#sns.heatmap(corr.loc['median_household_income',corr_subset.index])
plt.figure(figsize = (20,2))
sns.heatmap(pd.DataFrame([corr.loc[corr_subset.index, 'median_household_income'].T, corr.loc[corr_subset.index, 'sat_percentage_both_benchmark'].T]),
square = True, annot = True, cmap = "RdBu")
plt.yticks(rotation = 0);
We will first explore the disparity in test results among ethnic groups
# First we will perform a melt on the races columns so that the values are on a single column.
# This is in preparation for the plotting
# We will use the ACT scores for this one since they represent actual scores (hence more incremental) instead of benchmark percentage used in SAT
races_col = ['white', 'black', 'hispanic_or_latino', 'asian','american_indian/alaskan_native', 'hawaiian_and_other_pacific_islander',
'some_other_race_alone', 'two_or_more_races', 'median_household_income',
"act_average_composite_score", "act_average_reading_score", "act_average_english_score", "act_average_math_score", "act_average_science_score"]
races = [race.replace("_", " ") for race in races_col][0:-1]
data_races = pd.melt(school_district_merge_subset[~school_district_merge_subset["act_average_composite_score"].isnull()][races_col],
["act_average_composite_score", 'act_average_reading_score', 'act_average_english_score', 'act_average_math_score', 'act_average_science_score', 'median_household_income'],
var_name = 'races', value_name = 'percentage')
data_races.head()
| act_average_composite_score | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | median_household_income | races | percentage | |
|---|---|---|---|---|---|---|---|---|
| 0 | 26.00 | 27.0 | 26.0 | 26.0 | 25.0 | 104756.0 | white | 0.43 |
| 1 | 27.00 | 28.0 | 27.0 | 27.0 | 26.0 | 95783.0 | white | 0.44 |
| 2 | 25.00 | 26.0 | 25.0 | 25.0 | 24.0 | 85444.0 | white | 0.53 |
| 3 | 25.25 | 26.0 | 25.0 | 25.0 | 25.0 | 114277.0 | white | 0.47 |
| 4 | 29.50 | 29.0 | 30.0 | 30.0 | 29.0 | 133339.0 | white | 0.20 |
# Plotting the percentage of different races in a district against the average ACT composite score
fig, ax = plt.subplots(figsize = (22,12))
composite_line = sns.lineplot(data = data_races, x = "act_average_composite_score", y = "percentage", hue = 'races', linewidth = 3,
alpha = 0.8, palette = "colorblind", err_style=None, ax = ax)
# averages ethnic group concentrations
# White
composite_line.axhline(school_district_merge_subset['white'].mean(), ls = '--', c = 'steelblue', alpha = 0.7)
plt.text(14.1, school_district_merge_subset['white'].mean()+0.01,'State average: White')
# Black
composite_line.axhline(school_district_merge_subset['black'].mean(), ls = '--', c = 'orange', alpha = 0.5)
plt.text(14.1, school_district_merge_subset['black'].mean()+0.01,'State average: Black')
# Hispanic/Latino
composite_line.axhline(school_district_merge_subset['hispanic_or_latino'].mean(), ls = '--', c = 'teal', alpha = 0.5)
plt.text(14.1, school_district_merge_subset['hispanic_or_latino'].mean()+0.01,'State average: Hispanic/Latino')
# Asian
composite_line.axhline(school_district_merge_subset['asian'].mean(), ls = '--', c = 'orangered', alpha = 0.7)
plt.text(14.1, school_district_merge_subset['asian'].mean()+0.01,'State average: Asian')
plt.xlabel('ACT average composite score')
plt.ylabel('Percentage of each ethnic group in a school district');
plt.legend()
<matplotlib.legend.Legend at 0x169afec0c10>
We see a few clear trends:
The trends suggest that there is inequity in terms of educational achievement along ethinc lines.
If we take a look at the previous scatter plot of SAT benchmark against median household income, but with additional info for the majority race, we also see how educational and income inequity along ethnic lines are actually intertwined.
# Plot scatter plot of median household income vs SAT benchmark percentage
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = "median_household_income", y = "sat_percentage_both_benchmark",
s = 50, palette = "colorblind",hue = 'majority_race', alpha = 0.7)
plt.xlabel("Median household income")
plt.ylabel("SAT percentage above benchmark")
plt.title("Do school districts with higher median household income perform better?");
# Visualize median household income with different races
sns.pairplot(data = school_district_merge_subset,
x_vars=["white", "hispanic_or_latino", "asian", "black"],
y_vars=["median_household_income"],
diag_kind=None, height=4, kind = 'reg',
plot_kws = {"ci":None, 'scatter_kws': {"color":'#2B5A79', 'alpha' : 0.3}})
<seaborn.axisgrid.PairGrid at 0x169ae847fa0>
Let us sidetrack a little to see if the popular notion that Asians are better at math and science holds true.
# Plotting the percentage of different races in a district against the average ACT math score
plt.figure(figsize = (24,9))
math_line = sns.lineplot(data = data_races, x = "act_average_math_score", y = "percentage", hue = 'races', linewidth = 3,
alpha = 0.8, palette = "colorblind", err_style=None)
# averages ethnic group concentrations
# White
math_line.axhline(school_district_merge_subset['white'].mean(), ls = '--', c = 'steelblue', alpha = 0.7)
plt.text(14.5, school_district_merge_subset['white'].mean()+0.01,'State average: White')
# Black
math_line.axhline(school_district_merge_subset['black'].mean(), ls = '--', c = 'orange', alpha = 0.5)
plt.text(14.5, school_district_merge_subset['black'].mean()+0.01,'State average: Black')
# Hispanic/Latino
math_line.axhline(school_district_merge_subset['hispanic_or_latino'].mean(), ls = '--', c = 'teal', alpha = 0.5)
plt.text(14.5, school_district_merge_subset['hispanic_or_latino'].mean()+0.01,'State average: Hispanic/Latino')
# Asian
math_line.axhline(school_district_merge_subset['asian'].mean(), ls = '--', c = 'orangered', alpha = 0.7)
plt.text(14.5, school_district_merge_subset['asian'].mean()+0.01,'State average: Asian')
plt.xlabel('ACT average math score')
plt.ylabel('Percentage of each ethnic group in a school district');
# Plotting the percentage of different races in a district against the average ACT science score
plt.figure(figsize = (24,9))
sns.lineplot(data = data_races, x = "act_average_science_score", y = "percentage", hue = 'races', linewidth = 3,
alpha = 0.8, palette = "colorblind", err_style=None)
plt.xlabel('ACT average science score')
plt.ylabel('Percentage of each ethnic group in a school district');
# Plotting the percentage of different races in a district against the average ACT english score
plt.figure(figsize = (24,9))
sns.lineplot(data = data_races, x = "act_average_english_score", y = "percentage", hue = 'races', linewidth = 3,
alpha = 0.8, palette = "colorblind", err_style=None)
plt.xlabel('ACT average english score')
plt.ylabel('Percentage of each ethnic group in a school district');
Compared with english and science, Asians does seem to excel in math
We also see that another factor that has >0.5 positive correlation with the tests results is the percentage of household access to broadband internet. We shall examine this factor more closely.
# Visualize using histogram
sns.histplot(school_district_merge_subset["household_with_broadband_internet"], color = "#31838C", kde = True, stat = 'probability')
<AxesSubplot:xlabel='household_with_broadband_internet', ylabel='Probability'>
We are seeing a normal distribution with a left skew, quite a higher concentration from the 0.8-0.9 range, while the minimum is in the 0.3-0.4 range. Next we will plot the broadband internet rate against the SAT benchmark percentage.
# Visualize percentage broadband access and SAT score
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = 'household_with_broadband_internet', y = 'act_average_composite_score',
ec = '#154247', fc = "#31838C", alpha = 0.4)
plt.xlabel("Percentage of household with broadband internet")
plt.ylabel("ACT average composite score")
Text(0, 0.5, 'ACT average composite score')
# Visualize percentage broadband access and SAT score, with
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = 'household_with_broadband_internet', y = 'act_average_composite_score',
size = 'median_household_income', sizes = (10,500), ec = '#154247', fc = "#31838C", alpha = 0.4)
plt.xlabel("Percentage of household with broadband internet")
plt.ylabel("ACT average composite score")
legend = plt.legend(loc = 'upper left');
legend.set_title("Median household income")
# Create 4 "bins" according to quantile - <25%, 25-50%, 50-75%, >75%
def broadband_bins(row):
if row["household_with_broadband_internet"] < school_district_merge_subset['household_with_broadband_internet'].quantile(0.25):
row["household_with_broadband_internet_percentile"] = 1 + np.random.uniform(-0.3,0.3)
elif row["household_with_broadband_internet"] >= school_district_merge_subset['household_with_broadband_internet'].quantile(0.25) and row["household_with_broadband_internet"] < school_district_merge_subset['household_with_broadband_internet'].quantile(0.5):
row["household_with_broadband_internet_percentile"] = 2+ np.random.uniform(-0.3,0.3)
elif row["household_with_broadband_internet"] >= school_district_merge_subset['household_with_broadband_internet'].quantile(0.5) and row["household_with_broadband_internet"] < school_district_merge_subset['household_with_broadband_internet'].quantile(0.75):
row["household_with_broadband_internet_percentile"] = 3+ np.random.uniform(-0.3,0.3)
else:
row["household_with_broadband_internet_percentile"] = 4+ np.random.uniform(-0.3,0.3)
return row
school_district_merge_subset.apply(broadband_bins, axis =1)
# Visualize percentage broadband access and SAT score
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset.apply(broadband_bins, axis =1), x = 'household_with_broadband_internet_percentile', y = 'act_average_composite_score',
size = 'median_household_income', sizes = (1, 500), ec = '#154247', fc = "#31838C", alpha = 0.4)
plt.xlabel("Percentage of household with broadband internet")
plt.ylabel("ACT average composite score")
plt.xticks(ticks = [1,2,3,4], labels = ["0-25th percentile", "25-50th percentile", "50-75th percentile", "75-100th percentile"])
legend = plt.legend(loc = 'best');
legend.set_title("Median household income")
We see quite a clear positive releationship between percentage of household with broadband internet and the ACT score of the district. By also visualizing the median household income of the district as size of the bubble, we see that districts with lower percentage of households with internet are also typically those with lower median household income.
In the information age, access to internet is almost equivalent to access to information. Raising income of an area is a long term process, but in the short term, we could provide subsidized internet service to lower income households in order to provide access to knowledge which is essential to improving results in standardized tests.
school_district_merge_subset.head(0)
| NCES_ID | CDCode | county_name | district_name | district_type | urban_locale | total_enrolment | charter_school_percentage | non_charter_school_percentage | homeless_student_percentage | migrant_student_percentage | dropout_percentage | suspension_percentage | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | expenditure_per_student | majority_race | white_majority | act_enroll | act_num_test_taker | act_participation_rate | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | act_average_composite_score | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | act_average_composite_score_per_dollar_expenditure |
|---|
# First we will perform a melt on the races columns so that the values are on a single column.
# This is in preparation for the plotting
# We will use the ACT scores for this one since they represent actual scores (hence more incremental) instead of benchmark percentage used in SAT
hh_type_col = ['married_couple_household', 'cohabitating_couple_household', 'female_householder_household', 'male_householder_household', 'median_household_income',
"act_average_composite_score", "act_average_reading_score", "act_average_english_score", "act_average_math_score", "act_average_science_score"]
data_hh_type = pd.melt(school_district_merge_subset[~school_district_merge_subset["act_average_composite_score"].isnull()][hh_type_col],
["act_average_composite_score", 'act_average_reading_score', 'act_average_english_score', 'act_average_math_score', 'act_average_science_score', 'median_household_income'],
var_name = 'household_type', value_name = 'percentage')
data_hh_type.tail()
| act_average_composite_score | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | median_household_income | household_type | percentage | |
|---|---|---|---|---|---|---|---|---|
| 1351 | 27.50 | 28.0 | 27.0 | 28.0 | 27.0 | 67918.0 | male_householder_household | 0.05 |
| 1352 | 20.00 | 21.0 | 19.0 | 20.0 | 20.0 | 70699.0 | male_householder_household | 0.07 |
| 1353 | 19.50 | 20.0 | 19.0 | 19.0 | 20.0 | 68687.0 | male_householder_household | 0.06 |
| 1354 | 19.00 | 20.0 | 18.0 | 19.0 | 19.0 | 51851.0 | male_householder_household | 0.06 |
| 1355 | 21.75 | 23.0 | 21.0 | 21.0 | 22.0 | 88694.0 | male_householder_household | 0.03 |
# Plotting the percentage of different household types in a district against the average ACT composite score
fig, ax = plt.subplots(figsize = (22,12))
composite_line = sns.lineplot(data = data_hh_type, x = "act_average_composite_score", y = "percentage", hue = 'household_type', linewidth = 3,
alpha = 0.8, palette = "colorblind", err_style=None, estimator = 'mean')
# averages of household types
# married_couple_household
composite_line.axhline(school_district_merge_subset['married_couple_household'].mean(), ls = '--', c = 'steelblue', alpha = 0.7)
plt.text(14.1, school_district_merge_subset['married_couple_household'].mean()+0.01,'State average: married couple household')
# cohabitating_couple_household
composite_line.axhline(school_district_merge_subset['cohabitating_couple_household'].mean(), ls = '--', c = 'orange', alpha = 0.5)
plt.text(14.1, school_district_merge_subset['cohabitating_couple_household'].mean()+0.01,'State average: cohabitating couple household')
# female_householder_household
composite_line.axhline(school_district_merge_subset['female_householder_household'].mean(), ls = '--', c = 'teal', alpha = 0.5)
plt.text(14.1, school_district_merge_subset['female_householder_household'].mean()+0.01,'State average: female householder household')
# male_householder_household
composite_line.axhline(school_district_merge_subset['male_householder_household'].mean(), ls = '--', c = 'orangered', alpha = 0.7)
plt.text(14.1, school_district_merge_subset['male_householder_household'].mean()+0.01,'State average: male householder household')
plt.title("Do certain household types do better in standardized tests")
plt.xlabel('ACT average composite score')
plt.ylabel('Percentage of each household type in the school district');
plt.legend()
<matplotlib.legend.Legend at 0x169af990610>
The education level of parents should be correlated with the median household income and also broadband access, lets visualize the broadband access graph but with information for household income and parents education level.
# Visualize percentage broadband access and SAT score
plt.figure(figsize = (12,8))
sns.scatterplot(data = school_district_merge_subset, x = 'household_with_broadband_internet', y = 'act_average_composite_score',
size = 'median_household_income', sizes = (10,500), alpha = 0.9, hue = "bachelors_or_higher")
plt.xlabel("Percentage of household with broadband internet")
plt.ylabel("ACT average composite score")
legend = plt.legend(loc = 'upper left');
By visualizing the same graph with additional information, we can see how the 3 variables are interrelated and their relationship with the ACT score.
We can further investigate the reason why parents with higher education level might have done differently in their children's education, perhaps things like better academic and career advice might help, but that is a topic for another day.
We will also attempt to visualize the distribution of participation rate and benchmarks in terms of geographical distribution of school districts, to see if there are obvious geographical patterns which influences the test scores.
# Read shp file of california school districts
calfornia_gdf = gpd.read_file("../data/GIS/DistrictAreas1819.shp")
# concat with attributes from the full pandas dataframe containing the test and socio-demographic info
calfornia_gdf = pd.concat([calfornia_gdf[["DistrictNa", "geometry"]], school_district_merge_full], axis = 1).drop("DistrictNa", axis = 1)
calfornia_gdf.tail(2)
| geometry | NCES_ID | CDCode | county_name | district_name | district_type | urban_locale | total_enrolment | charter_school_percentage | non_charter_school_percentage | homeless_student_percentage | migrant_student_percentage | dropout_percentage | suspension_percentage | Unnamed: 0 | total_population | median_household_income | total_household | white | black | hispanic_or_latino | asian | american_indian/alaskan_native | hawaiian_and_other_pacific_islander | some_other_race_alone | two_or_more_races | housing_structure_built_2000_and_after | housing_structure_built_1970-1999 | housing_structure_built_before_1970 | household_with_broadband_internet | housing_structure_type_house | housing_structure_type_apartment | speak_english_only_children | under18_with_disability | under18_with_health_insurance | family_income_below_poverty | married_couple_household | cohabitating_couple_household | female_householder_household | male_householder_household | parents_not_in_labor_force | bachelors_or_higher | expenditure_per_student | majority_race | white_majority | act_enroll | act_num_test_taker | act_participation_rate | act_average_reading_score | act_average_english_score | act_average_math_score | act_average_science_score | act_num_above_average_score | act_percentage_above_average_score | act_average_composite_score | sat_enroll | sat_num_test_taker | sat_participation_rate | sat_num_erw_benchmark | sat_percentage_erw_benchmark | sat_num_math_benchmark | sat_percentage_math_benchmark | sat_num_both_benchmark | sat_percentage_both_benchmark | act_average_composite_score_per_dollar_expenditure | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 942 | POLYGON ((-13500775.523 4754748.024, -13500775... | 642330 | 5872751 | Yuba | Wheatland | Elementary | 31 - Town, Fringe | 1388 | 0.070605 | 0.929395 | 0.012 | 0.001 | NaN | 0.029 | 942 | 7563.0 | 72473.0 | 2525.0 | 0.72 | 0.02 | 0.15 | 0.03 | 0.01 | 0.01 | 0.0 | 0.06 | 0.342 | 0.427 | 0.232 | 0.861 | 0.851 | 0.149 | 0.927 | 0.051 | 0.961 | 0.059 | 0.74 | 0.07 | 0.12 | 0.07 | 0.227 | 0.212 | 13336.0 | white | yes | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 943 | POLYGON ((-13503989.762 4753344.067, -13503964... | 642350 | 5872769 | Yuba | Wheatland Union High | High | 41 - Rural, Fringe | 784 | 0.000000 | 1.000000 | 0.009 | 0.000 | 0.15 | 0.073 | 943 | 15588.0 | 88694.0 | 4916.0 | 0.65 | 0.03 | 0.19 | 0.05 | 0.01 | 0.01 | 0.0 | 0.07 | 0.583 | 0.280 | 0.138 | 0.901 | 0.916 | 0.084 | 0.908 | 0.013 | 0.914 | 0.083 | 0.74 | 0.09 | 0.14 | 0.03 | 0.282 | 0.267 | 15561.0 | white | yes | 167.0 | 18.0 | 0.107784 | 23.0 | 21.0 | 21.0 | 22.0 | 10.0 | 0.5556 | 21.75 | 381.0 | 115.0 | 0.301837 | 93.0 | 0.808696 | 63.0 | 0.547826 | 62.0 | 0.53913 | 0.001398 |
#calfornia_gdf.to_csv("../data/merged_df.csv")
# Import major california cities
import fiona
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
major_cal_cities_gdf = gpd.read_file('../data/GIS/california_major_cities.kml', driver='KML')
major_cal_cities_gdf.to_crs(crs = "EPSG:3857", inplace = True)
major_cal_cities_gdf.crs
<Derived Projected CRS: EPSG:3857> Name: WGS 84 / Pseudo-Mercator Axis Info [cartesian]: - X[east]: Easting (metre) - Y[north]: Northing (metre) Area of Use: - name: World between 85.06°S and 85.06°N. - bounds: (-180.0, -85.06, 180.0, 85.06) Coordinate Operation: - name: Popular Visualisation Pseudo-Mercator - method: Popular Visualisation Pseudo Mercator Datum: World Geodetic System 1984 ensemble - Ellipsoid: WGS 84 - Prime Meridian: Greenwich
We will visualize the geographical distribution of urbanisation in California.
# Visualizing on map
plt.figure(figsize = (20,12))
plt.suptitle("Do districts with higher urbanisation also get better ACT result?", fontsize = 20);
## Plot map
# Plot choropleth map
ax1 = plt.subplot(121)
calfornia_gdf.plot("urban_locale", legend = True, k = 10,cmap = "RdBu", alpha = 0.4, ax = ax1, legend_kwds={'title': "Urban Locale"})
plt.axis('off');
# plot centroids of each district
centroids = calfornia_gdf.copy()
centroids.geometry = calfornia_gdf.centroid
centroids['size'] = centroids['act_average_composite_score'] * 30
centroids.plot(column = 'act_average_composite_score', markersize = 30, ax = ax1, alpha =0.9, cmap = 'viridis', ec = 'grey',
legend = True, legend_kwds={'label':"ACT average composite score", 'shrink' :0.5})
## Plot right strip plot
ax2 = plt.subplot(222)
sns.stripplot(data = school_district_merge_subset, x = "urban_locale", y = "act_average_composite_score",
palette = "RdBu", s = 8, ax = ax2)
plt.xticks(rotation = 45);
plt.ylabel("ACT average composite score")
plt.xlabel("Urban locale of school district");
# plot bottom right boxplot
def urban_bins(row):
if 'City' in row["urban_locale"]:
row["urban_local_broad"] = "City"
elif 'Suburban' in row["urban_locale"]:
row["urban_local_broad"] = "Suburban"
elif 'Town' in row["urban_locale"]:
row["urban_local_broad"] = "Town"
else:
row["urban_local_broad"] = "Rural"
return row
## Plot boxplot for coaser grain urban locale
ax3 = plt.subplot(224)
sns.boxplot(data = school_district_merge_subset.apply(urban_bins, axis = 1), x = "urban_local_broad", y = "act_average_composite_score",
palette = "RdBu", order = ['City', 'Suburban', 'Town', 'Rural'], ax = ax3)
plt.xticks(rotation = 45);
plt.ylabel("ACT average composite score")
plt.xlabel("Urban locale of school district - 4 Categories");
plt.tight_layout()
#major_cal_cities_gdf.plot(ax = ax1, color = "green", markersize = 50, alpha = 0.5, edgecolor = 'black', linewidth = 3)
Seems like many of the cities and suburban areas are concentrated along the western coast, in fact, the two coastal clusters are San Francisco and Los Angeles.
Upon exmaination, these districts are more likely to obtain higher ACT scores. In general, districts within cities and suburbs are more likely to obtain higher ACT score than towns and rural area, costal cities are more likely to obtain higher score than inland cities.
And of course, there are income disparities between districts as well.
# Plot boxplot
sns.catplot(data = school_district_merge_subset, x = "urban_locale", y = "median_household_income", palette = "RdBu",
height = 4, aspect = 2)
plt.xticks(rotation = 90);
plt.xlabel("Urban locale of school district")
plt.ylabel("Median household income");
plt.title("Do districts with higher urbanisation also have higher median household income?");
We will also visualize the geographical distribution of SAT participation rate. Similar to the ACT score, the SAT participation rate also tend to be higher in coastal cities.
# Visualizing SAT participation rate
fig, ax = plt.subplots(figsize = (20,12))
plt.title("Do districts with higher urbanisation also have higher SAT participation rate?");
## Plot map
# Plot choropleth map
calfornia_gdf.plot("urban_locale", legend = True, k = 10,cmap = "RdBu", alpha = 0.4,
legend_kwds={'title': "Urban Locale"}, ax = ax)
plt.axis('off');
# plot centroids of each district
centroids = calfornia_gdf.copy()
centroids.geometry = calfornia_gdf.centroid
centroids['size'] = centroids['sat_participation_rate'] * 30
centroids.plot(column = 'sat_participation_rate', markersize = 30,alpha =0.9, cmap = 'viridis', ec = 'grey',
legend = True, legend_kwds={'label':"SAT participation rate", 'shrink' :0.5}, ax = ax, vmax = 0.7);
Key Takeaways and Recommendations
Increasing education expenditure per student may only be useful in improving standardized test score to an extend, above 10,000 USD per student, the effect becomes less apparent
From the experience of charter schools, we see that these schools are able to get better results with less expenditure per dollar spent, which further reinforce the notion that expenditure is limited in effectiveness and there is a dimishing marginal return of education expenditure
We see that median household income of a district is more highly correlated with standardidized test results, help should focus on lower income households in order to create a level playing field for students
We have also identified a few areas where there is both a disparity in income and disparity in standardized test result
The analysis allowed us to have a glimpse into the socio-economic division underlying the disparity in tests scores. As unversity admission is still an important pathway to social mobility, it is crucial that the underlying divisions are addressed, so that standardized tests do not perpetuate the existing division in the society.